What is person-oriented correspondence?
One example that we are working with is the letters between Elizabeth Barrett and Robert Browning (seen in the logo at right) in the 20 months before they got married. (They never wrote letters to each other after they got married since they were never again separated.) There are a variety of questions that we can ask, and hopefully answer, about these letters, and specifically the language used in them. For example, we can ask how their language changed over time as their relationship changed from strangers to spouses. Since the letters form a type of dialogue, we can also ask to what extent aspects of spoken dialogue occur in the letters. For example, do they adapt their speech to each other, especially in response to the previous letter(s)?
Another example of person-oriented correspondence that we are working with are letters from Ambrose Bierce. Here, we do not have the dialogue, but we can still look at how his language changes over a much greater period than the 20 months of the Barrett-Browning letters. We can also ask whether his language varies according to his correspondent. (The answer is, probably.)
Of course, the questions that we ask about the Barrett-Browning letters are the same ones we could ask about any corpus of letters between two people. Similarly, the questions we ask about the Bierce letters we could ask about any corpus of letters from a single individual — for example, letters from Michelangelo (see here for an example). What matters is not the individuals, but the defining characteristic of the corpora. Stepping back a bit, it is also clear that letters are only one type of correspondence. Telegrams, email, SMS are all types of correspondence, and corpora consisting of these types of correspondence lend themselves to the same kinds of questions, as long as they have the same defining characteristic.
Thinking about corpora in terms of definining characteristics lets us better conceptualize our inquiries, allowing us to find commonalities across seemingly disparate corpora. At the same time, visualizations can be extremely valuable in exploring and understanding corpora and data more generally, and visualizations that are appropriate for one corpus with a given defining characteristic will be appropriate for another corpus with the same defining characteristic. Thus, on a very practical level, by exploring in depth particular defining characteristics, as in the case of person-oriented correspondence, we can identify, create, and reuse visualizations for these corpora.
Work in progress
Corpora with defining characteristics are an instance of what I have called dataset genres in a series of talks. Refinements and further development of the notion of dataset genres are ongoing.
We are preparing other corpora of correspondence. Letters by Michelangelo (in Italian) were recently released.
On the linguistic analysis side, we have some preliminary results suggesting that Bierce wrote to women differently than he did to men, in subtle ways. Other analyses on the Barrett-Browning letters are still in the prelimary phase. There are also some surprising results concerning Michelangelo's use of the familiar and formal pronouns. You can discover these yourself using our visualization.
A variety of student projects and B.A. theses have involved person-oriented correspondence. Several projects have concerned automatically detected topics, e.g. visualizing relations between letters and topics, visualizing topics over time, similarity of letters by topic, etc. Other projects have looked at formality and sentiment analysis.
We have three small preliminary annotated corpora:
- Letters between Elizabeth Barrett and Robert Browning (1845-1846)
- Letters from Ambrose Bierce to a variety of people (1892-1913)
- Letters from Michelangelo Buonarotti to a variety of people (1497-1524)
All corpora are annotated with information about the letters (author and/or addressee, date, etc.), as well as with token, lemma, and part of speech information (all automatically generated). The Barrett-Browning letters also have (some) named entities annotated, while the Bierce and Michelangelo letters have some additional structural annotations (e.g. salutation and closing, paragraphs, etc.).
All corpora come in an XML version. The Barrett-Browning letters are in Text-Corpus Format (TCF), while the Bierce and Michelangelo letters are custom formats (DTDs provided). In addition, the Barrett-Browning and Michelangelo letters also come in other formats: individual letters as XML and all the letters as a tab delimited "vertical file".
The letters are freely available under a Creative Commons License, from the letters corpora page.
We use a variety of software tools in the project:
For text visualization, we use DoubleTreeJS, an interactive compact way of view concordance-like information. An example is here, where Elizabeth Browning's letters are in purple on the left, Robert Browning's letters are in green on right. For a more standard, though modernized, concordance view, we have KWICis.
For querying and visualizing ngrams distributions over time, we use Slash/A, a visualization developed by Velislava Todorova and Maria Chinkina. For more information, please see the Slash/A page, where you can download it (includes the Barrett-Browning corpus). The Michelangelo letters can also be used with the Slash/A.
For topic visualization, we have developed a tool to help explore and refine topics generated by MALLET (see below). Now available for download, TMT Assistant was written by Eyal Schejter and Sabrina Galasso. We also have done some ad hoc visualizations of topics in the Barrett-Browning letters, as in this example.
We also have another demonstration visualization, comparing letters by Michelangelo according to who he wrote to.
For linguistic analysis, we have used Weblicht and TreeTagger, as well as MALLET for topic modeling. We have also developed an Italian part of speech tagger, based on Pattern, which is included with the Michelangelo letters.
An additional excellent resource is Niall O'Leary's website with many visualizations of a wide variety of correspondence, using their metadata.
Chris Culy (lead)
Agnia Barsukova, Maria Chinkina, Daniël de Kok, Valentin Deyringer, Corina Dima, Emanuel Dima, Sabrina Galasso, Julia Hancke, Lars Horber, L. Lee McIntyre, Zeeshan Mustafa, Eran Raveh, Eyal Schejter, Velislava Todorova
Part of this work has been supported by the German Federal Ministry for Education and Research (BMBF) as part of the grant CLARIN-D. We would also like to thank Professor Hilary Nesi, Emma Moreton, and Siân Alsop at Coventry University for helpful discussion and access to their correspondence corpora.