Scott Kleinman – THATCamp Southern California 2012 http://socal2012.thatcamp.org The Humanities and Technology Camp Wed, 19 Sep 2012 16:48:38 +0000 en-US hourly 1 https://wordpress.org/?v=4.9.12 Text Mining Workshop http://socal2012.thatcamp.org/2012/09/05/text-mining-workshop/ http://socal2012.thatcamp.org/2012/09/05/text-mining-workshop/#comments Wed, 05 Sep 2012 20:00:42 +0000 http://socal2012.thatcamp.org/?p=462

Continue reading »]]>

The text mining workshop will take place on Friday, 14 September from 1:30-3:30 in Pavilion C (Main Room).

Workshop Description

This workshop will introduce the basic concept of text mining: the discovery of knowledge through the analysis of digital texts using computational approaches. The workshop will cover the stages of text mining from preparing the texts, to performing analyses, to visualising the results. We will focus on two emerging methods of text mining that are easy for the novice to learn but sophisticated enough to produce real results.

Lexomics is a method for clustering texts or parts of texts based on their word frequencies. The technique allows users to examine similarities and differences between texts in way that can point to interpretive insights or directions of further enquiry into the style, authorship, and origin of the texts. Topic modelling is a technique for using word frequencies to extract individual units of discourse (called “topics”) from texts so that texts can be compared based on the presence of certain topics or the proportion of certain topics can be traced across a corpus over time (or other criteria).

There will be a hands-on component to the workshop to allow participants to learn the software tools for exploring these methods. We will also have discussion about the epistemological and hermeutic issues raised by the use of text mining approaches to the analysis of texts in the Humanities.

Advance Preparation for the Workshop

No prior experience with computational text analysis is necessary. The tools for performing lexomics analysis are web based, so you do not need to download them in advance. These tools may be found on the Lexomics web site: wheatoncollege.edu/lexomics/tools/.

There are many tools for performing topic modelling, but we will use the GUI Topic Modeling Tool which may be downloaded at code.google.com/p/topic-modeling-tool/. Please download it in advance of the workshop. Note that in order to run the GUI Topic Modeling Tool, you will need to have Java installed on your computer. You can test whether Java is working and find out how to install it at www.java.com/en/download/testjava.jsp.

Please feel free to download the sample texts for use during the hands-on session.

Finally, please have a copy of Google Chrome or Firefox installed on your computer, as the lexomics tools have not been tested with Internet Explorer.

Background Reading:

Tools:

For convenience, here are some basic commands for operating the command-line version of MALLET. The first command imports the data and the second generates the topics:

 

bin\mallet import-dir –input data –output filename.mallet –keep-sequence –remove-stopwords


bin\mallet train-topics –input filename.mallet –num-topics 20 –output-state topic-state.gz –output-topic-keys filename.txt –output-doc-topics filename_composition

Update: A fuller set of instructions for using MALLET can be found at programminghistorian.org/lessons/topic-modeling-and-mallet.

Visualisation:

Still a challenge. I am working on a PHP-based topic browser that improves on the GUI Topic Modeling Tool output, but right now it only lives on my hard drive, so I can’t link to it. Elijah Meeks has made good use of Gephi, but it does not like my graphics card, so I haven’t tried it. It seems to be best suited to types of network analysis.

Right now, the easiest visualisation option seems to be opening CSV data for topic models in Excel and generating graphs there.

That said, I’m really impressed with Matt Jockers’ theme viewer, presented in anticipation of the publication of his book Macroanalysis: Digital Methods and Literary History (UIUC Press, 2013). It’s really just a combination of individually generated bar and line graphs, combined with word clouds, and stuck in a database, but it’s effective. Also worthy of mention is Elijah Meeks’ use of D3 to create a word cloud “topography”.

Workshop Presentation:

I’m going to re-work it into a blog post during the week after the conference. My blog is scottkleinman.net/.

]]> http://socal2012.thatcamp.org/2012/09/05/text-mining-workshop/feed/ 1
Workshop on Text Mining Tools http://socal2012.thatcamp.org/2012/06/01/workshop-on-text-mining-tools/ http://socal2012.thatcamp.org/2012/06/01/workshop-on-text-mining-tools/#comments Fri, 01 Jun 2012 15:42:38 +0000 http://socal2012.thatcamp.org/?p=271

Continue reading »]]>

Update: The proposed workshop went ahead. Details can be found at socal2012.thatcamp.org/2012/09/05/text-mining-workshop/.

Would anyone be interested in a workshop on easy text mining/computational analysis? I would happy to run a workshop focusing on one or both of two approaches: Lexomics and Topic Modelling. Both can be used by literary and linguistic scholars, as well as for historical and cultural analysis based on texts or text corpora.

Lexomics is a technique (and a suite of tools) for comparing texts (or chunks of texts) for similarities based on word frequency analysis. The tools allow you to divide texts or groups of texts into chunks and then perform hierarchical clustering on the chunks to compare word frequencies. The output is a tree diagram (dendrogram) which displays similarities between chunks. Here is an example of a dendrogram comparing the Old English poems Daniel and Azarias. It reveals that chunks 4-6 of Daniel have a marked similarity to Azarias (the result, it turns out, of a common source).

Lexomics is easy to learn and easy to teach to students.

Topic modelling uses a more complex algorithm to arrange words found in texts into thematic “topics” so that the frequency of these themes can be compared in different texts or different parts of texts. A good account of the technique can be found on Ted Underwood’s blog (and see the links supplied there).

Below is a list of examples of topic modeling taken from Clay Templeton’s Topic Modeling in the Humanities: An Overview:

Synchronic approaches (Unit of analysis is not time bound)
Matthew Jockers’ work on the Day of DH blog posts (2010).
Elijah Meeks’ work on self-definitions of digital humanists (2011).
Jeff Druin’s work on Proust (2011).
Travis Brown’s work on Jane Austen’s Emma and and Byron’s Don Juan (2011).

Diachronic Approaches (Unit of analysis is a time slice)
Cameron Blevins’ work on Martha Ballard’s diary (2010).
Robert K. Nelson’s work on the Richmond Daily Dispatch corpus (2011).
Yang, Torget, and Mihalcea’s work on Texas newspapers (2011).

My own students’ work on the Middle English poem Havelok the Dane is another good example of the technique (especally in a pedagogical context).

The easiest way to do topic modelling is with the Machine Learning for Language Toolkit (MALLET). There is a convenient GUI version which allows you to run MALLET without having to install MALLET (though that also is not very difficult).

Very little prior knowledge is needed for the workshop, other than basic manipulation of text files and Microsoft Excel. Everything else can be picked up pretty quickly. Access to a laptop will allow participants to download and experiment with the topic modelling tools (the Lexomics tools are web-based).

I could probably cover both Lexomics and topic modelling quickly in an hour, but two would be preferable to address both theory and method, as well as to allow people to get some hands on practice with their own texts.

]]> http://socal2012.thatcamp.org/2012/06/01/workshop-on-text-mining-tools/feed/ 12