Quotatives in the Land of Oz and beyond
"Oh, yes;" said the girl, ... "No," he said, ...
Last time I showed a simple tool to explore the language of the books by L. Frank Baum about the Land of Oz. While that post focused on words, we can also use the tool to explore some aspects of grammar as well. In this post, I'll look at quotatives (direct quotations), as in the subtitle of the post. (BTW, I've updated the tool to fix a couple problems, and to add the ability to see the examples in a larger context — click on an example to see its context.)
The first thing to notice from the examples in the subtitle is that the position of the subject with respect to the verb said is different: the girl follows the verb while he precedes it. To make a long story short, that is the typical pattern, not only in the Land of Oz, but in fiction more generally as well as in newspaper articles. But to get there we need to make a journey. To Oz and beyond!
As a shorthand, we can use S for subject, V for verb and Q (similar to the SVO shorthand of linguistic typology). Using this shorthand, the two examples are QVS and QSV. The QVS order is often called "quotative inversion" since the verb follows the subject, rather than preceding it as in more typical sentences. There are of course, 6 logically possible orders of S,V, and Q:
Of these orders, I haven't found any examples of (2) and (4), where the subject and verb are separated by the quotation. That doesn't mean they don't exist, just that I haven't found them. Good luck with your hunt! As well, (3) is somewhat rare, but here are a couple examples:
Said the Scarecrow to his personage: "Show us at once to your master, the Emperor."
Marvelous Land of Oz
Said she: "My friend, I reward you for your swiftness by proclaiming you Prince of Horses, ..."
Dorothy and Wizard in Oz
It turns out that VSQ order is fairly rare, and difficult to search for in the other corpora I'm using, so I'll set it aside, which is a bit of a shame, but there you have it.
Speaking of other corpora, people have noted that quotative inversion (QVS) occurs in fiction and in news articles more than in other genres, so I'll be using the Google ngram viewer to look for quotatives (not just QVS) in books (mainly fiction), and a commercial newspaper archive, Newspapers.com) for, well, news articles. The Library of Congress has a smaller, freely available newspaper archive, which shows has similar trends, but which generally seems to have a greater skews towards pronouns and towards males (he and man) than Newspapers.com.
|Library of Congress||Newspapers.com|
|he : she||18.20||3.23|
|the man went||34,432||160,802|
|the woman went||8,494||44,214|
|man : woman||4.05||3.63|
|he+she : man+woman||90.26||20.45|
Getting back to what I'll be looking for with quotatives, to keep things from turning into a dissertation, I'll only be considering the verb said, with subjects she, he, the X, the woman, and the man.
Getting our hands dirty with the data
In order to compare the Land of Oz books with general fiction and newspapers, there are a lot things we have to consider, given not only the corpora but also the tools we have to search them. This section lays out the choices I have made. If you don't want to get your hands dirty, you can go straight to the results.
Constraints of the tools
For the Land of Oz books (I'll call them WOZ for Wizard of Oz), we can do case (in)sensitive searches using regular expressions over words, spaces and punctuation (excepting commas), but we do not have any other linguistic information, like part of speech. For Newspapers.com, we can only do case insenstive word and phrase searches, no punctuation, and high frequency words (like she and he) are not searchable on their own. Finally, for Google ngrams, we can search for words and phrases and punctuation (except commas), along with some basic part of speech information. While case insensitive searches are possible for some searches, for our purposes, case sensitive searches are the way to go.
Approximating what we'd really like
Ideally, we'd like to be able to search for subjects, verbs and quotations, but none of the tools allows us to do that, so we have to construct searches that approximate what we'd really like. Part of that approximation is limiting our attention just to the verb said. Another approximation is using only she and he instead of all pronouns (though in fact the other pronouns occur only rarely in quotatives in WOZ).
Non-pronominal noun phrases are essentially impossible to find generally using these text-based tools. However, we can find limited examples and hope that they are representative of noun phrases more generally. For Newspapers.com and Google ngrams, I have used the woman and the man and combined the results. For WOZ, I have used the regular expression the \w+ to find noun phrases starting with the (how long they are depends on the broader query. Names are truly impossible to find generally in Newspapers.com and Google ngrams. For WOZ, I've used [A-Z]\w+ combined with broader searches to limit the applicability.
Quotations are also tricky to find. Newspapers.com does not allow us to search for punctuation, so quotations are impossible to find directly. The crude approximation I use is yes, as in yes she said. For WOZ and Google ngrams, we can look for quotation marks, but "split" quotations will trip us up, and I do not try to account for them.
"A balloon," said Oz, "is made of silk, ..."
Wonderful Wizard of Oz
We'd also like to compare the distribution of subjects in quotatives (pronouns vs. noun phrases) with subjects of other verbs. Again, we can't do this generally, so I use went as a representative verb.
In the appendix I've listed the queries I used.
Finally, there is the question of counting the results. As mentioned above, I combine the pronouns and the nouns separately. For WOZ, the counts are for all the books. For Newspapers.com, the counts are the page hits from 1900-1920 (the publication years of WOZ), where a page may have multiple instances (and there may also be duplicates, since articles get distributed by wire services). For Google ngrams, the counts are the average of the counts for 1900 and those for 1920. This means that we have the actual counts only for WOZ; for the other two corpora, we have approximations.
Reaching the first stop towards our destination
So what did I find with all these searches? The chart below shows the results, using a representation of the relationships of the magnitudes of the frequencies of the different categories in the three corpora. The frequencies indicated are not to scale, but give a fair representation of the relationships within corpora, which is what I'm interested in.
We can see a couple things from the chart. One is that for NPs (the X/woman/man), WOZ and news share the same frequency ordering: QSV < SVQ < QVS, while fiction reverses QSV and SVQ. However, all 3 corpora show quotative inversion (QVS) as the most frequent order for NPs.
A second thing we can see from the chart is that for she/he, WOZ and Fiction share the same frequency ordering: SVQ < QVS < QSV, while news reverses SVQ and QVS. However, once again all 3 corpora have the same most frequent order for she/he, namely QSV, with the quotation first but no inversion of subject and verb, unlike with NPs.
I can also add that while we can search for names only in WOZ, at least there they seem to pattern with NPs in WOZ, rather than she/he, having the frequency ordering: QSV < SVQ < QVS.
All of the frequency ordering patterns are highly significant (𝜒2, p < 0.00001).
It is also the case that she/he occur more often than NPs in each quotative context with exception of QVS in WOZ, where NPs are more common. While preponderance of pronouns in and of itself is not surprising, since pronouns are more common than (simple) NPs, what is notable is that when we compare the quotative contexts with X went, we find that that this comparison is also highly significant (again, 𝜒2, p < 0.00001), except for SVQ in WOZ, which is not significant (p < 0.97). In other words, the skew towards pronouns is significantly more pronounced in quotative contexts than in the non-quotative context. WOZ is exceptional, however, for the reversed preference in QVS, as well as in the non-significant distribution in SVQ.
It would be nice to compare fiction and non-fiction books, but Google ngrams does not separate out the two. However, we can compare fiction will all books ("General" in the chart below), using just she and the woman for simplicity.
What we see is that the two corpora have the same orderings for both she and for the woman. Since the general category includes fiction, it's not clear what else we can make from this.
However, there are other interesting things that we can see in Google ngrams, especially if we expand our view beyond the time frame of the Land of Oz. For example, we see an interesting change in quotative contexts with NPs. Around 1925, QSV for the woman and "the man starts taking off (in fiction), and by 1975 they've switched, so that QVS < QSV. In other words, quotative inversion (QVS) seems to be in decline. A quick look at Newspapers.com suggests a similar decline in quotative inversion in news articles.
We can also compare American and British English for quotative contexts. For example, with she, there are large variations in QSV for both varieties over time, and in their relative proportions. In addition, the QVS and SVQ orders with she have declined dramatically, becoming vanishingly infrequent by 2000.
To sum up, there are large consistencies across the corpora in the period 1900-1920, with quotative inversion (QVS) predominating with NPs and the other quotation initial order (QSV) predominating with she/he. We saw that WOZ sometimes patterns with fiction and sometimes with news. Finally, we saw interesting differences across time and space, especially in the decline of quotative inversion with NPs.
For all the searches, there's still many things that deserve (more) attention. At the level of the phenomena being investigated we have:
- VSQ order
- split quotes, e.g. "..." said X "..."
- coordinated verb phrases: ... and said Q
- other verbs besides said (e.g. ask,reply,...)
In addition, we could look at other properties of the components of quotative constructions besides pronoun vs NP subjects, such as:
- tense/aspect of the verb, e.g. say/said/saying
- length of the quotation
- length of NP subject
- "topicality" of the subject
- indications of formality: are certain quotative orders more/less formal?
- topics: are certain quotative orders used more/less with certain kinds of topics?
Also interesting would be a more detailed examination of differences across time and space.
Finally, the elephant in the room: an analysis. What I've presented here are observations and descriptions, but no analysis, which requires more kinds of information, as suggested above. However, these textual queries are not really adequate to gather those kinds of information. Rather, annotated data is needed, with the tools to search those annotations in addition to the text. The tools exist, but doing the annotation is the hard part.
Maybe for someone's dissertation...
But now it's time to follow the road of yellow brick(s) (in the books) or the yellow brick road (from the MGM movie) to seek new adventures ...
|SVQ||she||she said ['"“] - [”'"] she said ['"“]||"she said yes"||she said " + she said ' + She said " + She said '|
|SVQ||he||he said ['"“] - [”'"] he said ['"“]||"he said yes"||he said " + he said ' + He said " + He said '|
|SVQ||woman||the \w+ said ['"“] - [”'"] the \w+ said ['"“]||"the woman said yes"||the woman said " + the woman said ' + The woman said " + The woman said '|
|SVQ||man||"the man said yes"||"the man said yes"||the man said " + the man said ' + The man said " + The man said '|
|SVQ||name||[A-Z]\w+ said ['"“] - [”'"] [A-Z]\w+ said ['"“]||NA||NA|
|QSV||she||[”'"] she said||"yes she said"||" she said + ' she said|
|QSV||he||[”'"] he said||"yes he said"||" he said + ' he said|
|QSV||woman||[”'"] the \w+ said||"yes the woman said"||" the woman said + '|
|QSV||man||NA||"yes the man said"||" the man said + ' the man said|
|QSV||name||\w+?. [”'"]\s[A-Z]\w+ said||NA||NA|
|QVS||she||[”'"] said she||"yes said she"||" said she + ' said she|
|QVS||he||[”'"] said he||"yes said he"||" said he + ' said he|
|QVS||woman||[”'"] said the \w+||"yes said the woman"||" said the woman + ' said the woman|
|QVS||man||NA||"yes said the man"||" said the man + ' said the man|
|QVS||name||[”'"] said [A-Z]\w+||NA||NA|
|went||she||she went||she went||she went|
|went||he||he went||he went||he went|
|went||woman||the \w+ went||the woman went||the woman went|
|went||man||NA||the man went||the man went|
|went||name||[A-Z]\w+ went - He went - She went||NA||NA|