Jun 01

Profile photo of Scott Kleinman

Print this Post

Workshop on Text Mining Tools

Update: The proposed workshop went ahead. Details can be found at socal2012.thatcamp.org/2012/09/05/text-mining-workshop/.

Would anyone be interested in a workshop on easy text mining/computational analysis? I would happy to run a workshop focusing on one or both of two approaches: Lexomics and Topic Modelling. Both can be used by literary and linguistic scholars, as well as for historical and cultural analysis based on texts or text corpora.

Lexomics is a technique (and a suite of tools) for comparing texts (or chunks of texts) for similarities based on word frequency analysis. The tools allow you to divide texts or groups of texts into chunks and then perform hierarchical clustering on the chunks to compare word frequencies. The output is a tree diagram (dendrogram) which displays similarities between chunks. Here is an example of a dendrogram comparing the Old English poems Daniel and Azarias. It reveals that chunks 4-6 of Daniel have a marked similarity to Azarias (the result, it turns out, of a common source).

Lexomics is easy to learn and easy to teach to students.

Topic modelling uses a more complex algorithm to arrange words found in texts into thematic “topics” so that the frequency of these themes can be compared in different texts or different parts of texts. A good account of the technique can be found on Ted Underwood’s blog (and see the links supplied there).

Below is a list of examples of topic modeling taken from Clay Templeton’s Topic Modeling in the Humanities: An Overview:

Synchronic approaches (Unit of analysis is not time bound)
Matthew Jockers’ work on the Day of DH blog posts (2010).
Elijah Meeks’ work on self-definitions of digital humanists (2011).
Jeff Druin’s work on Proust (2011).
Travis Brown’s work on Jane Austen’s Emma and and Byron’s Don Juan (2011).

Diachronic Approaches (Unit of analysis is a time slice)
Cameron Blevins’ work on Martha Ballard’s diary (2010).
Robert K. Nelson’s work on the Richmond Daily Dispatch corpus (2011).
Yang, Torget, and Mihalcea’s work on Texas newspapers (2011).

My own students’ work on the Middle English poem Havelok the Dane is another good example of the technique (especally in a pedagogical context).

The easiest way to do topic modelling is with the Machine Learning for Language Toolkit (MALLET). There is a convenient GUI version which allows you to run MALLET without having to install MALLET (though that also is not very difficult).

Very little prior knowledge is needed for the workshop, other than basic manipulation of text files and Microsoft Excel. Everything else can be picked up pretty quickly. Access to a laptop will allow participants to download and experiment with the topic modelling tools (the Lexomics tools are web-based).

I could probably cover both Lexomics and topic modelling quickly in an hour, but two would be preferable to address both theory and method, as well as to allow people to get some hands on practice with their own texts.

About the author

Profile photo of Scott Kleinman

Scott Kleinman

I am a professor of English at California State University, Northridge, where I recently started a Center for the Digital Humanities. I work on medieval English language and literature and am interested in digital editing and computational stylistic analysis. My current projects include the production of an Archive of Early Middle English Texts and the development of algorithms for the automatic lemmatization of Old and Middle English texts (using mostly with PHP and TEI/XML).

Permanent link to this article: http://socal2012.thatcamp.org/2012/06/01/workshop-on-text-mining-tools/

  • thatcampsocal

    Thanks for kicking off the workshop proposals, Scott! A few questions to help our THATCampers get a better feel for this workshop, if you wouldn’t mind clarifying:

    1. A brief explanation of text mining?
    2. Who is your recommended target audience?
    3. What are a couple uses cases, or examples, that you can point people to look at?

    Thanks again!


    • Good suggestions. I think I’ll update the main post shortly, rather than answer in comments.

  • Joanna


  • I would be very interested in this from the perspective of cultural studies and historical analysis.  I have played with Google NGram viewer and sounds like this would be a more sophisticated version of this.  Thanks for proposing!

  • Richenda

    I’d be interested in both of these, but especially MALLET. Great suggestion!

  • Yes, I’d be interested as well! 

  • Yes, I’d be interested as well!

  • Scott, I’d say this workshop is a keeper! Thank you for offering to teach.

  • Pingback: Calling for #THATCampSoCal workshop proposals and requests! » THATCamp Southern California 2012()

  • Ryan N

    Scott, Yes, I’d love to attend this workshop! This will be my reason for flying down from British Columbia for the weekend. Ryan

    •  Ryan: We’ve definitely got this workshop scheduled.

      • I will post some preparatory material for the workshop in the next week. I’m excited by all the enthusiasm!

Skip to toolbar