Watch your balance!
For the past few years I've been helping with a project which is a social history of early women artisan photographers. In looking for examples in newspapers or on eBay, using phrases like "woman photographer", "female photographer", and "lady photographer" can be very useful. Being the recidivist linguist that I am, I decided to look for those phrases in a standard balanced corpus of historical English, namely the Corpus of Historical American English (COHA). Imagine my surprise when I found only a handful of results!
One reason I was surprised is that those phrases are very useful. For example, here are the tabulated search results from the commercial service Newspapers.com.
Google's ngram viewer, whose data is derived from books, gives similarly helpful results:
You're probably wondering what the handful of results from COHA are. Well here they are, all 5 of them, literally a handful. And since in the project we're interested primarily in the period from 1840 to 1930, these results aren't very useful, though it is fun to know that there is a female photographer in the Katharine Hepburn - Spencer Tracy movie Adam's Rib.
|lady photographer||woman photographer||female photographer|
|1956||Fiction||Last Angry Man||1978||Non Fiction||Wild Wild Woman||1949||Movie||Adam's Rib|
So what's going on with COHA? How does it miss the cultural phenomenon of the role of women photographers, as shown not only by the mentions in the newspapers and books, but also in the Census results:
I have to say that I'm not completely sure. It is certainly true that the absolute number of search results is not huge, and even as a proportion of the hits for "photographer", we're talking less than %1 at best for any decade.
It may simply be that as big as COHA is (400 million words, of which about 100 million words are from newspapers), it simply isn't big enough to include many female/woman/lady photographers, simply by luck of the draw.
The moral of the story is that a balanced corpus, even a large one, is not necessarily the best source of information about language usage. Obvious, but easy to forget.