Skip to main content

Word embeddings with small corpora

One of the things I'm interested in is how techniques that work in one context might work in other contexts, and what we can learn about those techniques when we go beyond their typical applications.

Word embeddings, aka word vectors, are typically used with large corpora, such as Wikipedia or Common Crawl web pages or massive numbers of tweets. One paper said something to the effect of "As long as your corpora have 100 million words, this technique will work."

But what if your corpus doesn't have 100 million words? What if you are interested in how an author uses words in just one book?

That question has prompted me to look at word embeddings and to see how they might be used with corpora that don't come anywhere near 100 million words. This is work in progress, and my ideas keep changing as I find out more. However, some of my preliminary write ups can be found here. In them, I propose a new evaluation measure that takes into account small vocabularies. I also discuss alternative ways to explore word similarities when the standard testsets are not relevant. Of course, some of these ways involve visualization.

Although my interest is sparked by small corpora, many of the issues and ideas are relevant for large corpora as well.

Since those first explorations, I have gone in some different directions, which I hope to be writing up soon. Stay tuned!