Jun 01

Print this Post

Workshop on Text Mining Tools

Update: The proposed workshop went ahead. Details can be found at socal2012.thatcamp.org/2012/09/05/text-mining-workshop/.

Would anyone be interested in a workshop on easy text mining/computational analysis? I would happy to run a workshop focusing on one or both of two approaches: Lexomics and Topic Modelling. Both can be used by literary and linguistic scholars, as well as for historical and cultural analysis based on texts or text corpora.

Lexomics is a technique (and a suite of tools) for comparing texts (or chunks of texts) for similarities based on word frequency analysis. The tools allow you to divide texts or groups of texts into chunks and then perform hierarchical clustering on the chunks to compare word frequencies. The output is a tree diagram (dendrogram) which displays similarities between chunks. Here is an example of a dendrogram comparing the Old English poems Daniel and Azarias. It reveals that chunks 4-6 of Daniel have a marked similarity to Azarias (the result, it turns out, of a common source).

Lexomics is easy to learn and easy to teach to students.

Topic modelling uses a more complex algorithm to arrange words found in texts into thematic “topics” so that the frequency of these themes can be compared in different texts or different parts of texts. A good account of the technique can be found on Ted Underwood’s blog (and see the links supplied there).

Below is a list of examples of topic modeling taken from Clay Templeton’s Topic Modeling in the Humanities: An Overview:

Synchronic approaches (Unit of analysis is not time bound)
Matthew Jockers’ work on the Day of DH blog posts (2010).
Elijah Meeks’ work on self-definitions of digital humanists (2011).
Jeff Druin’s work on Proust (2011).
Travis Brown’s work on Jane Austen’s Emma and and Byron’s Don Juan (2011).

Diachronic Approaches (Unit of analysis is a time slice)
Cameron Blevins’ work on Martha Ballard’s diary (2010).
Robert K. Nelson’s work on the Richmond Daily Dispatch corpus (2011).
Yang, Torget, and Mihalcea’s work on Texas newspapers (2011).

My own students’ work on the Middle English poem Havelok the Dane is another good example of the technique (especally in a pedagogical context).

The easiest way to do topic modelling is with the Machine Learning for Language Toolkit (MALLET). There is a convenient GUI version which allows you to run MALLET without having to install MALLET (though that also is not very difficult).

Very little prior knowledge is needed for the workshop, other than basic manipulation of text files and Microsoft Excel. Everything else can be picked up pretty quickly. Access to a laptop will allow participants to download and experiment with the topic modelling tools (the Lexomics tools are web-based).

I could probably cover both Lexomics and topic modelling quickly in an hour, but two would be preferable to address both theory and method, as well as to allow people to get some hands on practice with their own texts.

About the author

Scott Kleinman

I am a professor of English at California State University, Northridge, where I recently started a Center for the Digital Humanities. I work on medieval English language and literature and am interested in digital editing and computational stylistic analysis. My current projects include the production of an Archive of Early Middle English Texts and the development of algorithms for the automatic lemmatization of Old and Middle English texts (using mostly with PHP and TEI/XML).

Permanent link to this article: http://socal2012.thatcamp.org/2012/06/01/workshop-on-text-mining-tools/

Skip to toolbar