Word embeddings/vectors are meant to encode meaning without regard to the frequencies of the words BUT in practice the word frequencies have an effect on the vectors and more importantly on their similarities.
This matters since word vectors are used as input to other models — if the word vectors are distorted that could have a negative effect on the downstream components.
In this series of posts, I will look at a range of distributional and frequency effects in word embeddings, many of which are presented here for the first time. I consider four different models (Skip-Gram with Negative Sampling (sgns), FastText (ft), Glove, and Positive pointwise mutual information (ppmi) and show how they are similar and where they are different. The fact that they have different properties strongly suggests that the effects are not due to properties of language, but properties of the models.
In a subsequent series, I will set out a framework for exploring mitigating strategies, strategies whose goal is to improve performance on some metric, and show how those strategies perform on a couple metrics, including word similarities.
High level only. More details in the sections
My point of departure is the observation that distribution of word similarities in a corpus does not span the full range that is theoretically possible, and it is skewed. Introductory discussions of word embeddings talk about using the cosine of the angle between two vectors as the most common measure of similarity, and that the cosine ranges from -1 to 1. While that is mathematically correct, the observed cosine similarities in a corpus may not range from -1 to 1. Furthermore, the mean of the distribution is positive, not 0 as we might expect. Here is an example, using the Stanford Glove vectors, which are derived from Wikipedia and Gigaword. This shows the density estimate of the distrbution (the curved line) and the similarities that were sampled uniformly from all the similarities (the small vertical lines along the x axis).
I am particularly interested in small corpora, such as a single book, and the skew is even more striking with these small corpora. Here are examples comparing the four embedding methods mentioned above (skip n-gram with negative sampling (sgns), FastText (ft), Glove, and pointwise mutual information with svd (ppmi)) on Thackeray's Vanity Fair.
This starting point of shifted distributions will lead to a series of other related phenomena:
In each case I will compare the 4 models (as above) and point out similarities and differences across them. The differences show that the phenomena are not intrinsic to the corpus the vectors are derived from, but rather connected to the models. More generally, we want to know which (parts of) the phenomena are due to the nature of vector spaces (e.g. hubs), which are due to the particulars of the models (e.g. direction of stratification correlation), and which are due to the nature of language (e.g. perhaps the existence of frequency effects).
Although the distribution of vectors is distorted, it could be the case that these distortions are not a problem, or even that they are a positive factor. In a subsequent series, I will propose a conceptual framework for exploring how we can address the issues of distortion, and then show that these techniques can have a positive effect for some intrinsic evaluation, including word similarities.
In a nutshell, the conceptual framework is simply the observation that we can make 3 different kinds of modifications to our models (broadly construed):
Of course, we can also do all of those in combination.
Finally, in addition to the types of modification techniques, some methodological recommendations arise from this work, namely:
For simplicity, I'll use a very simple preprocessing step, slightly elaborated from the hyperwords package [3]. The preprocessing converts the input to ascii, lowercases it, and removes punctuation except for the apostrophes of contractions and possessive 's. Tokens are words, contractions (except n't is not a separate token), and the possessive 's. The actual command is the following:
iconv -c -f utf-8 -t ascii $1 | tr '[A-Z]' '[a-z]' | sed -E "s/[^a-z0-9']+/ /g" | sed -E "s/ '/ /g" | sed -E "s/' / /g" | sed -E "s/^'//g" | sed -E "s/'$//g" | sed -E "s/'v/ 'v/g" | sed -E "s/'ll/ 'll/g" | sed -E "s/'s/ 's/g" | sed -E "s/'d/ 'd/g" | sed -E "s/i'm/i 'm/g" | sed -E "s/'re/ 're/g" | sed -E "s/ +/ /g"
I have used spacy to split the input into sentences before doing the preprocessing.
In calculating the word vectors, I have used the genim [3] implementations of Skip Gram with Negative Sampling (sgns) and FastText. For Glove vectors, I use the original implementation [4]. For PPMI with SVD I use the hyperwords package [5] implementation, ported by me to Python3.
Thanks to John Bear for helpful suggestions. All faults are mine.
[1] Tobias Schnabel, Igor Labutov, David Mimno, and Thorsten Joachims. 2015. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.
[2] David Mimno and Laure Thompson. 2017. The strange geometry of skip-gram with negative sampling. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2873–2878.
[3] Gensim: https://radimrehurek.com/gensim/, published as "Software Framework for Topic Modelling with Large Corpora" Radim Řehůřek and Petr Sojka, Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45-50.
[4] Glove: https://nlp.stanford.edu/projects/glove/, pubished as "GloVe: Global Vectors for Word Representation" Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Empirical Methods in Natural Language Processing (EMNLP) 2014. pp. 1532--1543
[5] Hyperwords: https://bitbucket.org/omerlevy/hyperwords, published as "Improving Distributional Similarity with Lessons Learned from Word Embeddings" Omer Levy, Yoav Goldberg, and Ido Dagan. TACL 2015.