This is one of a series of posts on using word vectors with small corpora. In this post I propose a technique to address the issue of the instability of word vector models due to their random aspects.
Show Code
Most of the word vector approaches (e.g. word2vec, FastText, Glove, etc.) will give different results when training them multiple times on the same corpus with the same parameter settings. This is due to the random sampling done as part of the approach. For word2vec and related approaches, the random sampling can occur in at least 3 places:
With high frequency items in large corpora, the difference in results across runs (I'll mean runs with the same parameter settings for the rest of this post) may well be negliglible or even non-existent. However, for lower frequency items, and for small corpora, the differences can be striking. We might say that these approaches are unstable.
Here's an example, where we use Vanity Fair as our corpus, with just under 311,000 words . We'll look for the 5 words that are the most similar to man (high frequency), woman, pen (medium frequency), and fox (low frequency), and we'll compare them across 5 runs of word2vec with identical parameters.
# imports
from gensim import models
from gensim.models.fasttext import FastText
from scipy.stats.stats import spearmanr
#for tables in Jupyter
from IPython.display import HTML, display
import tabulate
# read sentences
fname = 'vanity_fair_pg599.txt-sents-clean.txt'
with open(fname) as f:
sents = [line.strip().split() for line in f.readlines()]
print('%s: %d sentences' % (fname, len(sents)))
%%bash
# Info about the text and the target words
f=vanity_fair_pg599.txt-sents-clean.txt
wc $f
w=man
echo "$w:"
grep -c $w $f
w=woman
echo "$w:"
grep -c $w $f
w=pen
echo "$w:"
grep -c $w $f
w=fox
echo "$w:"
grep -c $w $f
w=happy #used later
echo "$w:"
grep -c $w $f
# make some word2vec models
num_runs = 5
sg = 1 #skip ngram
(min_count,window,size,workers,downsample) = (2,5,20,2,0.001)
wmodels = [models.Word2Vec(sents, sg=sg, min_count=min_count, window=window, sample=downsample, size=size, workers=2) for i in range(num_runs)]
def compare_tops(vmodels,item,topn=5):
tops = []
for i,m in enumerate(vmodels):
tops.append( ["Model %d" %i] + ["<b>%s</b> %0.4f" % x for x in m.wv.similar_by_word(item, topn=topn, restrict_vocab=None)] )
display(HTML(tabulate.tabulate(tops, tablefmt='html', headers=[item]+list(range(1,topn+1)))))
def compare_all(vmodels,item):
sims = [m.wv.similar_by_word(item, topn=False) for m in vmodels]
rhos = []
for i,sim in enumerate(sims):
row = ["<b>Model %d</b>" % (i+1)]
for j in range(0,i+1):
row.append("%0.4f" % spearmanr(sims[i],sims[j])[0])
rhos.append(row)
display(HTML(tabulate.tabulate(rhos, tablefmt='html', headers=[item]+["Model %d" % i for i in range(1,topn+1)])))
topn=5
wd = 'man'
display(HTML("<h4>%d most similar words</h4>" %topn))
compare_tops(wmodels,wd,topn=topn)
display(HTML("<h4>Spearman rho correlation for ranking of all words compared to %s</h4>" % wd))
compare_all(wmodels,wd)
topn=5
wd = 'woman'
display(HTML("<h4>%d most similar words</h4>" %topn))
compare_tops(wmodels,wd,topn=topn)
display(HTML("<h4>Spearman rho correlation for ranking of all words compared to %s</h4>" % wd))
compare_all(wmodels,wd)
topn=5
wd = 'pen'
display(HTML("<h4>%d most similar words</h4>" %topn))
compare_tops(wmodels,wd,topn=topn)
display(HTML("<h4>Spearman rho correlation for ranking of all words compared to %s</h4>" % wd))
compare_all(wmodels,wd)
topn=5
wd = 'fox'
display(HTML("<h4>%d most similar words</h4>" %topn))
compare_tops(wmodels,wd,topn=topn)
display(HTML("<h4>Spearman rho correlation for ranking of all words compared to %s</h4>" % wd))
compare_all(wmodels,wd)
From these examples, we can see that although different runs of word2vec with same parameters are similar (Spearman rho > 0.99), they are not identical. Not only do the similarity numbers differ, but even the relative ranking of the items may differ from one run to the next. For pen, even the most similar word differs across runs. These differences are problematic when we are trying to get an idea of how particular words are used in a small corpus: why should we choose one run over another?
There are a few different ways we can avoid the unstableness of word2vec et al. One way is to use an approach that does not use randomization, e.g. the ppmi_svd approach, which we will return to in the next post. Another way is to fix a random seed, so that even though randomness is used, it is the same randomness every time we run the model. Yet another way is to elimate (as much as possible) the use of randomness in an approach, e.g. by not downsizing, not using negative sampling, etc. (See [1] for much more detailed discussion.) However, these "tricks" defeat the purpose of using randomness in the first place, and are thus unsatisifying.
However, we can use the small size of the corpora to our advantage. We can come up with a "consensus" model by taking the average of a number of models, i.e finding the centroid of the models. We can either use a fixed number of models, or we can iteratively update the centroid model by setting a threshold for similarity across iterations.
In the example below, we will iteratively update a centroid model as follows:
def compare2models(m1,m2,wd,n=10):
"""
compute spearman r for the n closest words to wd
"""
if wd not in m1.wv.vocab or wd not in m2.wv.vocab:
return((float("NaN"),float("NaN"))) #oov
tops1 = m1.wv.similar_by_word(wd, topn=n, restrict_vocab=None)
tops2 = m2.wv.similar_by_word(wd, topn=n, restrict_vocab=None)
tops_both = set.intersection(set([x[0] for x in tops1]), set([x[0] for x in tops2])) #need intersection since there might be differences in vocab size
ranks1 = [m1.wv.rank(wd,w) for w in tops_both]
ranks2 = [m2.wv.rank(wd,w) for w in tops_both]
return spearmanr(ranks1, ranks2)
def compare_models_words(m1,m2,wds,n=10):
"""
compute spearman r for the n closest words to to each wd in wds
"""
return( [(wd,compare2models(m1,m2,wd,n)[0]) for wd in wds] )
def update_centroid(m1,m2,n):
"""
return m2 *modified* to be the weighted average of (n*m1 + m2)/(n+1)
i.e. we're using this to iteratively update an average
"""
m2.wv.vectors = (n*m1.wv.vectors + m2.wv.vectors)/(n+1)
m2.init_sims()
return(m2)
def iterate_centroid(sents,wds, params=(2,5,20,2,0.001), sg=1, n=10, runs=5, threshold=0.99, method="word2vec", show_progress=True):
"""
iterate runs with same parameters on words with n most similar
params is: (min_count,window,size,workers,sample) = (2,5,20,2,0.001)
after each run, average with previous iteration to create modified model
threshold is the average spearman rho for the wds in the models
method is either "word2vec" or "FastText"
return a list of: the centroid model, whether we converged, and the first model
"""
if method is None or method is "word2vec":
modeler = models.Word2Vec
elif method is "FastText":
modeler = FastText
else:
raise ValueError("Unknown method: %s" % method)
(min_count,window,size,workers,sample) = params
firstm = modeler(sents, sg=sg, min_count=min_count, window=window, sample=sample, size=size, workers=2)
prevm = firstm
for i in range(runs):
currm = modeler(sents, sg=sg, min_count=min_count, window=window, sample=sample, size=size, workers=workers)
newm = update_centroid(prevm,currm,i) #currm *must* be second arg, since it gets modified by average_runs
total = 0
met_thresh = True
for x in compare_models_words(m1=prevm,m2=newm,wds=wds,n=n):
total += x[1]
met_thresh = met_thresh and (x[1] > threshold)
if show_progress:
print("Mean rho:\t%0.20f" % (total/len(wds)))
if met_thresh:
break
prevm = newm
return((newm,met_thresh, firstm))
def show_iterated_centroid(sents, wds, n=10, runs=40, threshold=0.99, method="word2vec"):
"""
show the results of constructing an iterated centroid using wds for the convergence
return the centroid, or the first model if it didn't converge
"""
(m,OK,firstm) = iterate_centroid(sents,words, n=n, runs=nruns, threshold=thresh, method=method)
if OK:
print("\n%s centroid model" % method)
vecs = m.wv
for w in words:
print(w)
if w not in vecs.vocab:
print("\tOOV")
continue
for x in vecs.similar_by_word(w, topn=n, restrict_vocab=None):
print("\t%s\t%f" % x)
print()
else:
print("Didn't get to each spearman rho of %0.6f" % thresh)
print("\n%s first model" % method)
vecs = firstm.wv
for w in words:
print(w)
if w not in vecs.vocab:
print("\tOOV")
continue
for x in vecs.similar_by_word(w, topn=n, restrict_vocab=None):
print("\t%s\t%f" % x)
print()
m=firstm
return(m)
words=["man","woman","pen","fox","sit","about"]
word2="happy"
n = 5
nruns = 40
thresh = 0.99
method="word2vec"
m = show_iterated_centroid(sents, words, n=n, runs=nruns, threshold=thresh, method=method)
#save it for future use
fnamev = fname + "-" + method + ".vecs"
m.wv.save(fnamev)
#we'll make another one for future use as well
(m2,_,_) = iterate_centroid(sents,words, params=(10,10,100,2,0.001), sg=1, n=10, runs=5, threshold=0.99, method="word2vec", show_progress=False)
fnamev2 = fname + "-" + method + "-win10-dim100-thresh10.vecs"
m2.wv.save(fnamev2)
print("New word:",word2)
for x in m.wv.similar_by_word(word2, topn=n, restrict_vocab=None):
print("\t%s\t%f" % x)
method="FastText"
m = show_iterated_centroid(sents,words, n=n, runs=nruns, threshold=thresh, method=method)
#save it for future use
fnamev = fname + "-" + method + ".vecs"
m.wv.save(fnamev)
print("New word:",word2)
for x in m.wv.similar_by_word(word2, topn=n, restrict_vocab=None):
print("\t%s\t%f" % x)
The differences between word2vec and FastText are quite striking, but they are not surprising, given that FastText is designed to find similarities among morphologically related words (e.g. happy and unhappy). Another thing to note is that in informal testing, FastText seems to converge to the desired threshold for the centroid in fewer interations than word2vec. However, since FastText takes longer to run, there is no clear cut speed advantage.
I should note that the convergence procedure used here is not guaranteed to converge, in particular in the case when the generated models are further from the centroid than the threshold. In practice, using uncommon words can lead to non-convergence, but using medium to very common words seems to work well. An alternative to using the closest n words would be to compare the given word(s) to the entire vocabulary. In addition, instead of having a fixed set of comparison words, we could pick a random set of words instead, maybe at each iteration.
To sum up, using the (approximate) centroid model is a way to have the best of both worlds: we can use the random aspects of the word vector approaches and find a model that smooths over the unstableness of the approaches. Given that it may take several iterations to find the centroid model, this approach may not be feasible for large corpora, but it is definitely feasible for small corpora.
Other posts
[1] Johannes Hellrich and Udo Hahn. 2016. Bad Company—Neighborhoods in Neural Embedding Spaces Considered Harmful. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 2785–2796, Osaka, Japan, December 11-17 2016.