Text Mining Workshop

Scott Kleinman — Wed, 05 Sep 2012 20:00:42 +0000

The text mining workshop will take place on Friday, 14 September from 1:30-3:30 in Pavilion C (Main Room).

Workshop Description

This workshop will introduce the basic concept of text mining: the discovery of knowledge through the analysis of digital texts using computational approaches. The workshop will cover the stages of text mining from preparing the texts, to performing analyses, to visualising the results. We will focus on two emerging methods of text mining that are easy for the novice to learn but sophisticated enough to produce real results.

Lexomics is a method for clustering texts or parts of texts based on their word frequencies. The technique allows users to examine similarities and differences between texts in way that can point to interpretive insights or directions of further enquiry into the style, authorship, and origin of the texts. Topic modelling is a technique for using word frequencies to extract individual units of discourse (called “topics”) from texts so that texts can be compared based on the presence of certain topics or the proportion of certain topics can be traced across a corpus over time (or other criteria).

There will be a hands-on component to the workshop to allow participants to learn the software tools for exploring these methods. We will also have discussion about the epistemological and hermeutic issues raised by the use of text mining approaches to the analysis of texts in the Humanities.

Advance Preparation for the Workshop

No prior experience with computational text analysis is necessary. The tools for performing lexomics analysis are web based, so you do not need to download them in advance. These tools may be found on the Lexomics web site: wheatoncollege.edu/lexomics/tools/.

There are many tools for performing topic modelling, but we will use the GUI Topic Modeling Tool which may be downloaded at code.google.com/p/topic-modeling-tool/. Please download it in advance of the workshop. Note that in order to run the GUI Topic Modeling Tool, you will need to have Java installed on your computer. You can test whether Java is working and find out how to install it at www.java.com/en/download/testjava.jsp.

Please feel free to download the sample texts for use during the hands-on session.

Finally, please have a copy of Google Chrome or Firefox installed on your computer, as the lexomics tools have not been tested with Internet Explorer.

Background Reading:

The Lexomics web site contains background and tutorials, and there is further information on the publications and grants page. All the tools are accessible from the tools page on that site.
Topic Modeling for Humanists: A Guided Tour: The most recent account by Scott Weingart. It contains links to most of the readings below and more. It is organised into sections which go deeper and deeper into the complexities of topic modelling so that you are not thrown into the deep end.
Topic modeling made just simple enough: Ted Underwood’s very coherent account of LDA and some best practices for humanists.
What kinds of “topics” does topic modeling actually produce: This post, also by Ted Underwood, also addresses best practices, including issues of topic coherence and the scale of the corpus.
The LDA Buffet is Now Open; or, Latent Dirichlet Allocation for English Majors: Matt Jockers’ allegory of the LDA Buffet.
Topic Modeling and Network Analysis: Scott Weingart’s description of topic modelling and its relationship with network analysis provides links to the major bibliography for topic modelling.
Topic Modeling in the Humanities: An overview: Clay Templeton’s introduction.
Why use visualizations to study poetry?: Lisa Rhody’s account of her experiments on ekphrasis (scroll to the bottom to see my comments).
Reading Tea Leaves: How Humans Interpret Topic Models: An interesting study of the way human interpret topic models. A video presentation of the project is also available.
Some Assembly Required: Understanding and Interpreting Topics in LDA Models of Figurative Language: Lisa Rhody discusses the tendency of topic modelling to produce “semantically opaque topics” (and what they can tell us).
Topic Modeling for Humanists: A Guided Tour: Scott Weingart’s version of this collection of links. He provides slightly more explanation and a few more links (in addition to the ones here).

Tools:

GUI Topic Modeling Tool (uses MALLET as its back end)
MALLET (useful installation instructions here)

For convenience, here are some basic commands for operating the command-line version of MALLET. The first command imports the data and the second generates the topics:

bin\mallet import-dir –input data –output filename.mallet –keep-sequence –remove-stopwords

bin\mallet train-topics –input filename.mallet –num-topics 20 –output-state topic-state.gz –output-topic-keys filename.txt –output-doc-topics filename_composition

Update: A fuller set of instructions for using MALLET can be found at programminghistorian.org/lessons/topic-modeling-and-mallet.

Visualisation:

Still a challenge. I am working on a PHP-based topic browser that improves on the GUI Topic Modeling Tool output, but right now it only lives on my hard drive, so I can’t link to it. Elijah Meeks has made good use of Gephi, but it does not like my graphics card, so I haven’t tried it. It seems to be best suited to types of network analysis.

Right now, the easiest visualisation option seems to be opening CSV data for topic models in Excel and generating graphs there.

That said, I’m really impressed with Matt Jockers’ theme viewer, presented in anticipation of the publication of his book Macroanalysis: Digital Methods and Literary History (UIUC Press, 2013). It’s really just a combination of individually generated bar and line graphs, combined with word clouds, and stuck in a database, but it’s effective. Also worthy of mention is Elijah Meeks’ use of D3 to create a word cloud “topography”.

Workshop Presentation:

I’m going to re-work it into a blog post during the week after the conference. My blog is scottkleinman.net/.

Text Mining – THATCamp Southern California 2012