Word vectors with small corpora

© 2018 Chris Culy, March 2018



In this series of posts, I'll explore some questions concerning how, and to what extent, the word vector approach to meaning can be usefully applied to small corpora. I'll look at issues of randomization and evaluation, as well as how to explore word similarities. While the focus here is on small corpora, many of the points are relevant to large corpora as well.


One important view about word meaning is that the meaning of a word is determined by (or manifested in) the word contexts in which it occurs. Word vectors are a computational way to concretize this distributional idea of meaning. While there are many different specific approaches to word vectors, they all have at their heart the following idea: for each word in our corpus, we count how many times other words occur in its context. The approaches vary in terms of what counts as the context, how the counting is done, and how the raw counts are transformed (e.g. normalization by the total number of words). The "vectors" of word vectors are simply the lists of the transformed counts.

There are various choices to be made when analyzing a corpus with word vectors. Some of them are specific to the approach, but the main choices across approaches are the following:

  • the kind of preprocessing of the text (e.g. lowercasing, removing stop words, etc)
  • minimum count: the lowest frequency words to include (of those remaining after the preprocessing)
  • window: the size of the context:
  • dimension: the lengths of the vectors to use for the final representations

Word vectors have become a popular technique for constructing word meaning computationally, partly because they are simple, but mainly because they are effective at helping to do other tasks, whether directly related to meaning (e.g. finding similar words) or indirectly (e.g. parsing, summarization, translation, etc). A large part of the success of word vectors is due to their use with large corpora, tens of millions of words and more (one word vector tool says that as long as there are at least 10 million words in the corpus, the tool should work well).

However, in some domains, especially in digital humanities, the corpus of interest might be much smaller. For example, the combined letters of Elizabeth Barrett and Robert Browning to each other has fewer than 500,000 tokens, including punctuation http://chrisculy.net/lx/resources/. The longish novel Moby Dick has just over 200,000 words, excluding punctuation. The shortish novel Heart of Darkness has under 40,000 words excluding punctuation.

These lengths are a far cry from 10s of millions, or even a million, words. There are thus questions about how, and to what extent, the word vector approaches can be usefully applied to these smaller corpora. In this series of posts, I'll explore some of those questions, including issues of randomization, evaluation, and how we can explore word similarities.

Technical details


For simplicity, I'll use a very simple preprocessing step, slightly elaborated from the hyperwords package [2]. The preprocessing converts the input to ascii, lowercases it, and removes punctuation except for the apostrophes of contractions and possessive 's. Tokens are words, contractions (except n't is not a separate token), and the possessive 's. The actual command is the following:

iconv -c -f utf-8 -t ascii $1 | tr '[A-Z]' '[a-z]' | sed -E "s/[^a-z0-9']+/ /g" | sed -E "s/ '/ /g" | sed -E "s/' / /g" | sed -E "s/^'//g" | sed -E "s/'$//g" | sed -E "s/'v/ 'v/g" | sed -E "s/'ll/ 'll/g" | sed -E "s/'s/ 's/g" | sed -E "s/'d/ 'd/g" | sed -E "s/i'm/i 'm/g" | sed -E "s/'re/ 're/g" | sed -E "s/  +/ /g"

For some experiments, I have used spacy to split the input into sentences before doing the preprocessing.

Word vector approaches

I will use 3 approaches to word vectors:

  • word2vec
  • FastText
  • SVD reduction of Positive Pointwise Mutual Information (ppmi_svd)

word2vec has become a standard reference point for word vectors; FastText is an extension of word2vec that incorporates sub-word information. I use the gensim [1] implementations of these.

The ppmi_svd approach was chosen in part in connection with the randomization issues, and in part because it has been reported to work better than word2vec and others for smaller corpora (but corpora still large by the standards here). I use the hyperwords package [2] implementation, ported by me to Python3.


This exploration could not have been done without the tools and research that others have done, especially the work in gensim and hyperwords.

[1] Gensim: https://radimrehurek.com/gensim/, published as "Software Framework for Topic Modelling with Large Corpora" Radim Řehůřek and Petr Sojka, Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45-50, 22 May 2010.

[2] Hyperwords: https://bitbucket.org/omerlevy/hyperwords, published as "Improving Distributional Similarity with Lessons Learned from Word Embeddings" Omer Levy, Yoav Goldberg, and Ido Dagan. TACL 2015.