Text Mining Workshop

The text mining workshop will take place on Friday, 14 September from 1:30-3:30 in Pavilion C (Main Room).

Workshop Description

This workshop will introduce the basic concept of text mining: the discovery of knowledge through the analysis of digital texts using computational approaches. The workshop will cover the stages of text mining from preparing the texts, to performing analyses, to visualising the results. We will focus on two emerging methods of text mining that are easy for the novice to learn but sophisticated enough to produce real results.

Lexomics is a method for clustering texts or parts of texts based on their word frequencies. The technique allows users to examine similarities and differences between texts in way that can point to interpretive insights or directions of further enquiry into the style, authorship, and origin of the texts. Topic modelling is a technique for using word frequencies to extract individual units of discourse (called “topics”) from texts so that texts can be compared based on the presence of certain topics or the proportion of certain topics can be traced across a corpus over time (or other criteria).

There will be a hands-on component to the workshop to allow participants to learn the software tools for exploring these methods. We will also have discussion about the epistemological and hermeutic issues raised by the use of text mining approaches to the analysis of texts in the Humanities.

Advance Preparation for the Workshop

No prior experience with computational text analysis is necessary. The tools for performing lexomics analysis are web based, so you do not need to download them in advance. These tools may be found on the Lexomics web site: wheatoncollege.edu/lexomics/tools/.

There are many tools for performing topic modelling, but we will use the GUI Topic Modeling Tool which may be downloaded at code.google.com/p/topic-modeling-tool/. Please download it in advance of the workshop. Note that in order to run the GUI Topic Modeling Tool, you will need to have Java installed on your computer. You can test whether Java is working and find out how to install it at www.java.com/en/download/testjava.jsp.

Please feel free to download the sample texts for use during the hands-on session.

Finally, please have a copy of Google Chrome or Firefox installed on your computer, as the lexomics tools have not been tested with Internet Explorer.

Background Reading:


For convenience, here are some basic commands for operating the command-line version of MALLET. The first command imports the data and the second generates the topics:


bin\mallet import-dir –input data –output filename.mallet –keep-sequence –remove-stopwords

bin\mallet train-topics –input filename.mallet –num-topics 20 –output-state topic-state.gz –output-topic-keys filename.txt –output-doc-topics filename_composition

Update: A fuller set of instructions for using MALLET can be found at programminghistorian.org/lessons/topic-modeling-and-mallet.


Still a challenge. I am working on a PHP-based topic browser that improves on the GUI Topic Modeling Tool output, but right now it only lives on my hard drive, so I can’t link to it. Elijah Meeks has made good use of Gephi, but it does not like my graphics card, so I haven’t tried it. It seems to be best suited to types of network analysis.

Right now, the easiest visualisation option seems to be opening CSV data for topic models in Excel and generating graphs there.

That said, I’m really impressed with Matt Jockers’ theme viewer, presented in anticipation of the publication of his book Macroanalysis: Digital Methods and Literary History (UIUC Press, 2013). It’s really just a combination of individually generated bar and line graphs, combined with word clouds, and stuck in a database, but it’s effective. Also worthy of mention is Elijah Meeks’ use of D3 to create a word cloud “topography”.

Workshop Presentation:

I’m going to re-work it into a blog post during the week after the conference. My blog is scottkleinman.net/.

About the author

Scott Kleinman

I am a professor of English at California State University, Northridge, where I recently started a Center for the Digital Humanities. I work on medieval English language and literature and am interested in digital editing and computational stylistic analysis. My current projects include the production of an Archive of Early Middle English Texts and the development of algorithms for the automatic lemmatization of Old and Middle English texts (using mostly with PHP and TEI/XML).

