Update: The proposed workshop went ahead. Details can be found at socal2012.thatcamp.org/2012/09/05/text-mining-workshop/.
Would anyone be interested in a workshop on easy text mining/computational analysis? I would happy to run a workshop focusing on one or both of two approaches: Lexomics and Topic Modelling. Both can be used by literary and linguistic scholars, as well as for historical and cultural analysis based on texts or text corpora.
Lexomics is a technique (and a suite of tools) for comparing texts (or chunks of texts) for similarities based on word frequency analysis. The tools allow you to divide texts or groups of texts into chunks and then perform hierarchical clustering on the chunks to compare word frequencies. The output is a tree diagram (dendrogram) which displays similarities between chunks. Here is an example of a dendrogram comparing the Old English poems Daniel and Azarias. It reveals that chunks 4-6 of Daniel have a marked similarity to Azarias (the result, it turns out, of a common source).
Lexomics is easy to learn and easy to teach to students.
Topic modelling uses a more complex algorithm to arrange words found in texts into thematic “topics” so that the frequency of these themes can be compared in different texts or different parts of texts. A good account of the technique can be found on Ted Underwood’s blog (and see the links supplied there).
Below is a list of examples of topic modeling taken from Clay Templeton’s Topic Modeling in the Humanities: An Overview:
Synchronic approaches (Unit of analysis is not time bound)
Matthew Jockers’ work on the Day of DH blog posts (2010).
Elijah Meeks’ work on self-definitions of digital humanists (2011).
Jeff Druin’s work on Proust (2011).
Travis Brown’s work on Jane Austen’s Emma and and Byron’s Don Juan (2011).
Diachronic Approaches (Unit of analysis is a time slice)
Cameron Blevins’ work on Martha Ballard’s diary (2010).
Robert K. Nelson’s work on the Richmond Daily Dispatch corpus (2011).
Yang, Torget, and Mihalcea’s work on Texas newspapers (2011).
My own students’ work on the Middle English poem Havelok the Dane is another good example of the technique (especally in a pedagogical context).
The easiest way to do topic modelling is with the Machine Learning for Language Toolkit (MALLET). There is a convenient GUI version which allows you to run MALLET without having to install MALLET (though that also is not very difficult).
Very little prior knowledge is needed for the workshop, other than basic manipulation of text files and Microsoft Excel. Everything else can be picked up pretty quickly. Access to a laptop will allow participants to download and experiment with the topic modelling tools (the Lexomics tools are web-based).
I could probably cover both Lexomics and topic modelling quickly in an hour, but two would be preferable to address both theory and method, as well as to allow people to get some hands on practice with their own texts.
11 comments
1 ping
Skip to comment form ↓
thatcampsocal
June 1, 2012 at 12:53 pm (UTC -7) Link to this comment
Thanks for kicking off the workshop proposals, Scott! A few questions to help our THATCampers get a better feel for this workshop, if you wouldn’t mind clarifying:
1. A brief explanation of text mining?
2. Who is your recommended target audience?
3. What are a couple uses cases, or examples, that you can point people to look at?
Thanks again!
Colleen
Scott Kleinman
June 1, 2012 at 3:49 pm (UTC -7) Link to this comment
Good suggestions. I think I’ll update the main post shortly, rather than answer in comments.
Joanna
June 12, 2012 at 1:04 pm (UTC -7) Link to this comment
Yes!
Karyl Ketchum
June 20, 2012 at 12:53 pm (UTC -7) Link to this comment
I would be very interested in this from the perspective of cultural studies and historical analysis. I have played with Google NGram viewer and sounds like this would be a more sophisticated version of this. Thanks for proposing!
Richenda
July 2, 2012 at 3:39 pm (UTC -7) Link to this comment
I’d be interested in both of these, but especially MALLET. Great suggestion!
Viola Lasmana
July 24, 2012 at 9:32 pm (UTC -7) Link to this comment
Yes, I’d be interested as well!
Viola Lasmana
July 24, 2012 at 9:32 pm (UTC -7) Link to this comment
Yes, I’d be interested as well!
Colleen Greene
August 1, 2012 at 11:58 am (UTC -7) Link to this comment
Scott, I’d say this workshop is a keeper! Thank you for offering to teach.
Ryan N
September 1, 2012 at 7:36 am (UTC -7) Link to this comment
Scott, Yes, I’d love to attend this workshop! This will be my reason for flying down from British Columbia for the weekend. Ryan
Colleen Greene
September 4, 2012 at 10:40 am (UTC -7) Link to this comment
Ryan: We’ve definitely got this workshop scheduled.
Scott Kleinman
September 4, 2012 at 12:54 pm (UTC -7) Link to this comment
I will post some preparatory material for the workshop in the next week. I’m excited by all the enthusiasm!
Calling for #THATCampSoCal workshop proposals and requests! » THATCamp Southern California 2012
August 1, 2012 at 12:05 pm (UTC -7) Link to this comment
[…] Workshop Proposal: Text Mining Tools […]