The nora project
I came to the nora project via a mention of it on the Text Analytics mailing list. The goal of the project is to support Humanities scholars in the interpretation of literary works. I had a look at the demo video which shows the use of the noraVis prototype data mining and visualization software.
The demo video can be viewed here .
The demo takes a sample of 300 Emily Dickinson's letters and attempts to locate all the letters from the sample which contain erotic language. The software requires the user to manually rate a small sample set on a scale of 1 to 5, 5 being most erotic or "hot", and 1 being least erotic or "not hot". In this demo, the sample consisted of 40 in total, 20 of which were manually tagged as "hot", and 20 which were tagged as "not hot". This manually tagged set serves as the "training set" used by the classification algorithm driving the data mining process. This form of manual tagging by a user is also called "supervised learning" because the training process is supervised manually by the user. Some classification algorithms need no manual training, these are called "unsupervised". Once the 40 letters in the training set have been defined, the classification algorithm can go to work, and once done, it provides statistical probabilities for each of the 300 letters as to their level of erotic language on the same 1 to 5 scale.
Classification tools like this are nothing new, but what is new and particularily intersting is the way in which the noraVis application displays the data and the layout of the user interface components. It provides the user with a coordinated display of a word index with relevancy indicators, document titles, document predictions, document ratings and the full text. Clicking on one of the data sets affects the display of the information in the others allowing for a novel way to read, rate and anayze a set of documents.
If the document contains meta data (in the demo, some meta data included date of composition, word count per document, recipient name etc.), these meta data values can be used in a scatter plot graph to view correlations between these data values and the letters.
One can play with the different variables to see for example if Dickinson's earlier letters were "more erotic" by correlating the classified letters with the date of composition.
This is a very useful tool, and very well thought out.






