Jun 01

Workshop on Text Mining Tools

Categories:

June 1, 2012

Update: The proposed workshop went ahead. Details can be found at socal2012.thatcamp.org/2012/09/05/text-mining-workshop/.

Would anyone be interested in a workshop on easy text mining/computational analysis? I would happy to run a workshop focusing on one or both of two approaches: Lexomics and Topic Modelling. Both can be used by literary and linguistic scholars, as well as for historical and cultural analysis based on texts or text corpora.

Lexomics is a technique (and a suite of tools) for comparing texts (or chunks of texts) for similarities based on word frequency analysis. The tools allow you to divide texts or groups of texts into chunks and then perform hierarchical clustering on the chunks to compare word frequencies. The output is a tree diagram (dendrogram) which displays similarities between chunks. Here is an example of a dendrogram comparing the Old English poems Daniel and Azarias. It reveals that chunks 4-6 of Daniel have a marked similarity to Azarias (the result, it turns out, of a common source).

Click here to enlarge.

Lexomics is easy to learn and easy to teach to students.

Topic modelling uses a more complex algorithm to arrange words found in texts into thematic “topics” so that the frequency of these themes can be compared in different texts or different parts of texts. A good account of the technique can be found on Ted Underwood’s blog (and see the links supplied there).

Below is a list of examples of topic modeling taken from Clay Templeton’s Topic Modeling in the Humanities: An Overview:

Synchronic approaches (Unit of analysis is not time bound)
Matthew Jockers’ work on the Day of DH blog posts (2010).
Elijah Meeks’ work on self-definitions of digital humanists (2011).
Jeff Druin’s work on Proust (2011).
Travis Brown’s work on Jane Austen’s Emma and and Byron’s Don Juan (2011).

Diachronic Approaches (Unit of analysis is a time slice)
Cameron Blevins’ work on Martha Ballard’s diary (2010).
Robert K. Nelson’s work on the Richmond Daily Dispatch corpus (2011).
Yang, Torget, and Mihalcea’s work on Texas newspapers (2011).

My own students’ work on the Middle English poem Havelok the Dane is another good example of the technique (especally in a pedagogical context).

The easiest way to do topic modelling is with the Machine Learning for Language Toolkit (MALLET). There is a convenient GUI version which allows you to run MALLET without having to install MALLET (though that also is not very difficult).

Very little prior knowledge is needed for the workshop, other than basic manipulation of text files and Microsoft Excel. Everything else can be picked up pretty quickly. Access to a laptop will allow participants to download and experiment with the topic modelling tools (the Lexomics tools are web-based).

I could probably cover both Lexomics and topic modelling quickly in an hour, but two would be preferable to address both theory and method, as well as to allow people to get some hands on practice with their own texts.

This post has no tag

About the author

Scott Kleinman

I am a professor of English at California State University, Northridge, where I recently started a Center for the Digital Humanities. I work on medieval English language and literature and am interested in digital editing and computational stylistic analysis. My current projects include the production of an Archive of Early Middle English Texts and the development of algorithms for the automatic lemmatization of Old and Middle English texts (using mostly with PHP and TEI/XML).

Permanent link to this article: http://socal2012.thatcamp.org/2012/06/01/workshop-on-text-mining-tools/

11 comments

1 ping

Skip to comment form ↓

thatcampsocal

June 1, 2012 at 12:53 pm (UTC -7) Link to this comment

Log in to Reply

Thanks for kicking off the workshop proposals, Scott! A few questions to help our THATCampers get a better feel for this workshop, if you wouldn’t mind clarifying:

1. A brief explanation of text mining?
2. Who is your recommended target audience?
3. What are a couple uses cases, or examples, that you can point people to look at?

Thanks again!

Colleen
1. Scott Kleinman
  
  June 1, 2012 at 3:49 pm (UTC -7) Link to this comment
  
  Log in to Reply
  
  Good suggestions. I think I’ll update the main post shortly, rather than answer in comments.
Joanna

June 12, 2012 at 1:04 pm (UTC -7) Link to this comment

Log in to Reply

Yes!
Karyl Ketchum

June 20, 2012 at 12:53 pm (UTC -7) Link to this comment

Log in to Reply

I would be very interested in this from the perspective of cultural studies and historical analysis. I have played with Google NGram viewer and sounds like this would be a more sophisticated version of this. Thanks for proposing!
Richenda

July 2, 2012 at 3:39 pm (UTC -7) Link to this comment

Log in to Reply

I’d be interested in both of these, but especially MALLET. Great suggestion!
Viola Lasmana

July 24, 2012 at 9:32 pm (UTC -7) Link to this comment

Log in to Reply

Yes, I’d be interested as well!
Viola Lasmana

July 24, 2012 at 9:32 pm (UTC -7) Link to this comment

Log in to Reply

Yes, I’d be interested as well!
Colleen Greene

August 1, 2012 at 11:58 am (UTC -7) Link to this comment

Log in to Reply

Scott, I’d say this workshop is a keeper! Thank you for offering to teach.
Ryan N

September 1, 2012 at 7:36 am (UTC -7) Link to this comment

Log in to Reply

Scott, Yes, I’d love to attend this workshop! This will be my reason for flying down from British Columbia for the weekend. Ryan
1. Colleen Greene
  
  September 4, 2012 at 10:40 am (UTC -7) Link to this comment
  
  Log in to Reply
  
  Ryan: We’ve definitely got this workshop scheduled.
  1. Scott Kleinman
    
    September 4, 2012 at 12:54 pm (UTC -7) Link to this comment
    
    Log in to Reply
    
    I will post some preparatory material for the workshop in the next week. I’m excited by all the enthusiasm!