# Stabilizing randomness¶

## Summary¶

This is one of a series of posts on using word vectors with small corpora. In this post I propose a technique to address the issue of the instability of word vector models due to their random aspects.

Show Code

## Background¶

Most of the word vector approaches (e.g. word2vec, FastText, Glove, etc.) will give different results when training them multiple times on the same corpus with the same parameter settings. This is due to the random sampling done as part of the approach. For word2vec and related approaches, the random sampling can occur in at least 3 places:

• within a window: words in the context may be sampled (uniformly or not) rather than exhaustively enumerated
• negative sampling: to create examples of a non-existent ("negative") context, words from outside the context are sampled
• downsizing: high frequency words may be sampled rather than exhaustively enumerated

With high frequency items in large corpora, the difference in results across runs (I'll mean runs with the same parameter settings for the rest of this post) may well be negliglible or even non-existent. However, for lower frequency items, and for small corpora, the differences can be striking. We might say that these approaches are unstable.

Here's an example, where we use Vanity Fair as our corpus, with just under 311,000 words . We'll look for the 5 words that are the most similar to man (high frequency), woman, pen (medium frequency), and fox (low frequency), and we'll compare them across 5 runs of word2vec with identical parameters.

In [1]:
# imports
from gensim import models
from gensim.models.fasttext import FastText
from scipy.stats.stats import spearmanr

#for tables in Jupyter
from IPython.display import HTML, display
import tabulate

In [2]:
# read sentences
fname = 'vanity_fair_pg599.txt-sents-clean.txt'
with open(fname) as f:
sents = [line.strip().split() for line in f.readlines()]
print('%s: %d sentences' % (fname, len(sents)))

vanity_fair_pg599.txt-sents-clean.txt: 13689 sentences

In [3]:
%%bash
# Info about the text and the target words

f=vanity_fair_pg599.txt-sents-clean.txt
wc $f w=man echo "$w:"
grep -c $w$f
w=woman
echo "$w:" grep -c$w $f w=pen echo "$w:"
grep -c $w$f
w=fox
echo "$w:" grep -c$w $f w=happy #used later echo "$w:"
grep -c $w$f

   13689  310721 1671293 vanity_fair_pg599.txt-sents-clean.txt
man:
1799
woman:
346
pen:
415
fox:
12
happy:
162

In [4]:
# make some word2vec models
num_runs = 5
sg = 1 #skip ngram
(min_count,window,size,workers,downsample) = (2,5,20,2,0.001)

wmodels = [models.Word2Vec(sents, sg=sg, min_count=min_count, window=window, sample=downsample, size=size, workers=2) for i in range(num_runs)]

In [5]:
def compare_tops(vmodels,item,topn=5):
tops = []
for i,m in enumerate(vmodels):
tops.append( ["Model %d" %i] + ["<b>%s</b> %0.4f" % x  for x in m.wv.similar_by_word(item, topn=topn, restrict_vocab=None)] )

def compare_all(vmodels,item):
sims = [m.wv.similar_by_word(item, topn=False) for m in vmodels]

rhos = []
for i,sim in enumerate(sims):
row = ["<b>Model %d</b>" % (i+1)]
for j in range(0,i+1):
row.append("%0.4f" % spearmanr(sims[i],sims[j])[0])
rhos.append(row)

display(HTML(tabulate.tabulate(rhos, tablefmt='html', headers=[item]+["Model %d" % i for i in range(1,topn+1)])))


### word2vec comparison for high frequency word: man¶

In [6]:
topn=5
wd = 'man'

display(HTML("<h4>%d most similar words</h4>" %topn))
compare_tops(wmodels,wd,topn=topn)

display(HTML("<h4>Spearman rho correlation for ranking of all words compared to %s</h4>" % wd))
compare_all(wmodels,wd)


#### 5 most similar words

man 1 2 3 4 5
Model 0gentleman 0.8581devil 0.8446 character 0.8424fellow 0.8402 nobleman 0.8380
Model 1gentleman 0.8539nobleman 0.8519 devil 0.8506 character 0.8501sense 0.8466
Model 2gentleman 0.8537devil 0.8529 character 0.8503nobleman 0.8491 true 0.8455
Model 3devil 0.8532 gentleman 0.8511character 0.8481nobleman 0.8474 sense 0.8461
Model 4devil 0.8511 gentleman 0.8510character 0.8497nobleman 0.8474 true 0.8435

#### Spearman rho correlation for ranking of all words compared to man

man Model 1 Model 2 Model 3 Model 4 Model 5
Model 1 1
Model 2 0.9983 1
Model 3 0.9987 0.9988 1
Model 4 0.9991 0.9987 0.9991 1
Model 5 0.9985 0.9985 0.9987 0.9994 1

### word2vec comparison for medium frequency word: woman¶

In [7]:
topn=5
wd = 'woman'

display(HTML("<h4>%d most similar words</h4>" %topn))
compare_tops(wmodels,wd,topn=topn)

display(HTML("<h4>Spearman rho correlation for ranking of all words compared to %s</h4>" % wd))
compare_all(wmodels,wd)


#### 5 most similar words

woman 1 2 3 4 5
Model 0girl 0.9211heart 0.8999 creature 0.8954soul 0.8808simple 0.8760
Model 1girl 0.9217creature 0.8988heart 0.8985 soul 0.8818simple 0.8813
Model 2girl 0.9217heart 0.8981 creature 0.8962soul 0.8816simple 0.8793
Model 3girl 0.9221heart 0.8983 creature 0.8957soul 0.8819simple 0.8816
Model 4girl 0.9221heart 0.9034 creature 0.8990soul 0.8879simple 0.8848

#### Spearman rho correlation for ranking of all words compared to woman

woman Model 1 Model 2 Model 3 Model 4 Model 5
Model 1 1
Model 2 0.999 1
Model 3 0.999 0.9993 1
Model 4 0.9994 0.9991 0.9993 1
Model 5 0.9991 0.9991 0.9992 0.9996 1

### word2vec comparison for medium frequency word: pen¶

In [8]:
topn=5
wd = 'pen'

display(HTML("<h4>%d most similar words</h4>" %topn))
compare_tops(wmodels,wd,topn=topn)

display(HTML("<h4>Spearman rho correlation for ranking of all words compared to %s</h4>" % wd))
compare_all(wmodels,wd)


#### 5 most similar words

pen 1 2 3 4 5
Model 0permission 0.9862villain 0.9861 console 0.9854conscience 0.9813strength 0.9810
Model 1permission 0.9867console 0.9854 villain 0.9853seek 0.9814 conscience 0.9810
Model 2permission 0.9859villain 0.9859 console 0.9849conscience 0.9823seek 0.9809
Model 3permission 0.9861villain 0.9857 console 0.9856conscience 0.9815seek 0.9807
Model 4villain 0.9859 permission 0.9858console 0.9848conscience 0.9815seek 0.9805

#### Spearman rho correlation for ranking of all words compared to pen

pen Model 1 Model 2 Model 3 Model 4 Model 5
Model 1 1
Model 2 0.9988 1
Model 3 0.9989 0.9992 1
Model 4 0.9992 0.9989 0.9991 1
Model 5 0.9991 0.9989 0.9992 0.9997 1

### word2vec comparison for low frequency word: fox¶

In [9]:
topn=5
wd = 'fox'

display(HTML("<h4>%d most similar words</h4>" %topn))
compare_tops(wmodels,wd,topn=topn)

display(HTML("<h4>Spearman rho correlation for ranking of all words compared to %s</h4>" % wd))
compare_all(wmodels,wd)


#### 5 most similar words

fox 1 2 3 4 5
Model 0reconcilement 0.9888pew 0.9887milor 0.9879 balance 0.9878 mild 0.9875
Model 1reconcilement 0.9890pew 0.9884observation 0.9879diplomatist 0.9878evidently 0.9876
Model 2reconcilement 0.9891pew 0.9884balance 0.9882 observation 0.9881diplomatist 0.9880
Model 3reconcilement 0.9884pew 0.9883balance 0.9873 diplomatist 0.9872evidently 0.9871
Model 4reconcilement 0.9887pew 0.9882milor 0.9873 balance 0.9873 observation 0.9872

#### Spearman rho correlation for ranking of all words compared to fox

fox Model 1 Model 2 Model 3 Model 4 Model 5
Model 1 1
Model 2 0.9983 1
Model 3 0.9984 0.9989 1
Model 4 0.9988 0.9987 0.9988 1
Model 5 0.9985 0.9984 0.9986 0.9995 1

From these examples, we can see that although different runs of word2vec with same parameters are similar (Spearman rho > 0.99), they are not identical. Not only do the similarity numbers differ, but even the relative ranking of the items may differ from one run to the next. For pen, even the most similar word differs across runs. These differences are problematic when we are trying to get an idea of how particular words are used in a small corpus: why should we choose one run over another?

## A simple idea¶

There are a few different ways we can avoid the unstableness of word2vec et al. One way is to use an approach that does not use randomization, e.g. the ppmi_svd approach, which we will return to in the next post. Another way is to fix a random seed, so that even though randomness is used, it is the same randomness every time we run the model. Yet another way is to elimate (as much as possible) the use of randomness in an approach, e.g. by not downsizing, not using negative sampling, etc. (See [1] for much more detailed discussion.) However, these "tricks" defeat the purpose of using randomness in the first place, and are thus unsatisifying.

However, we can use the small size of the corpora to our advantage. We can come up with a "consensus" model by taking the average of a number of models, i.e finding the centroid of the models. We can either use a fixed number of models, or we can iteratively update the centroid model by setting a threshold for similarity across iterations.

In the example below, we will iteratively update a centroid model as follows:

• We fix the parameters for our model.
• We pick some set of words as our words of interest.
• We pick some number n to find the most similar words for each of the words of interest.
• We compare models' similarity on the n most similar words to each of the words of interest using Spearman's rho on the ranks of those words.
• We iterate, updating the average of all the models each time. We compare the current average with the previous average, until the Spearman rho for every word is above a threshold, or we reach a limit on the number of iterations. NB: this is the average (centroid) for the whole model, not just for the words we're interested in. They are only used to judge the convergence on the centroid.
• The final average model is what we can then use for further investigations.

## Iteratively update the centroid model¶

In [10]:
def compare2models(m1,m2,wd,n=10):
"""
compute spearman r for the n closest words to wd
"""

if wd not in m1.wv.vocab or wd not in m2.wv.vocab:
return((float("NaN"),float("NaN"))) #oov

tops1 = m1.wv.similar_by_word(wd, topn=n, restrict_vocab=None)
tops2 = m2.wv.similar_by_word(wd, topn=n, restrict_vocab=None)

tops_both = set.intersection(set([x[0] for x in tops1]), set([x[0] for x in tops2])) #need intersection since there might be differences in vocab size

ranks1 = [m1.wv.rank(wd,w) for w in tops_both]
ranks2 = [m2.wv.rank(wd,w) for w in tops_both]

return spearmanr(ranks1, ranks2)

def compare_models_words(m1,m2,wds,n=10):
"""
compute spearman r for the n closest words to to each wd in wds
"""

return( [(wd,compare2models(m1,m2,wd,n)[0]) for wd in wds] )

def update_centroid(m1,m2,n):
"""
return m2 *modified* to be the weighted average of (n*m1 + m2)/(n+1)
i.e. we're using this to iteratively update an average
"""

m2.wv.vectors = (n*m1.wv.vectors + m2.wv.vectors)/(n+1)
m2.init_sims()
return(m2)

def iterate_centroid(sents,wds, params=(2,5,20,2,0.001), sg=1, n=10, runs=5, threshold=0.99, method="word2vec", show_progress=True):
"""
iterate runs with same parameters on words with n most similar
params is: (min_count,window,size,workers,sample) = (2,5,20,2,0.001)
after each run, average with previous iteration to create modified model

threshold is the average spearman rho for the wds in the models

method is either "word2vec" or "FastText"

return a list of: the centroid model, whether we converged, and the first model
"""

if method is None or method is "word2vec":
modeler = models.Word2Vec
elif method is "FastText":
modeler = FastText
else:
raise ValueError("Unknown method: %s" % method)

(min_count,window,size,workers,sample) = params

firstm = modeler(sents, sg=sg, min_count=min_count, window=window, sample=sample, size=size, workers=2)
prevm = firstm

for i in range(runs):
currm = modeler(sents, sg=sg, min_count=min_count, window=window, sample=sample, size=size, workers=workers)
newm = update_centroid(prevm,currm,i) #currm *must* be second arg, since it gets modified by average_runs

total = 0
met_thresh = True
for x in compare_models_words(m1=prevm,m2=newm,wds=wds,n=n):
total += x[1]
met_thresh = met_thresh and (x[1] > threshold)
if show_progress:
print("Mean rho:\t%0.20f" % (total/len(wds)))

if met_thresh:
break

prevm = newm
return((newm,met_thresh, firstm))

def show_iterated_centroid(sents, wds, n=10, runs=40, threshold=0.99, method="word2vec"):
"""
show the results of constructing an iterated centroid using wds for the convergence

return the centroid, or the first model if it didn't converge
"""

(m,OK,firstm) = iterate_centroid(sents,words, n=n, runs=nruns, threshold=thresh, method=method)
if OK:
print("\n%s centroid model" % method)
vecs = m.wv
for w in words:
print(w)
if w not in vecs.vocab:
print("\tOOV")
continue
for x in vecs.similar_by_word(w, topn=n, restrict_vocab=None):
print("\t%s\t%f" % x)
print()
else:
print("Didn't get to each spearman rho of %0.6f" % thresh)
print("\n%s first model" % method)
vecs = firstm.wv
for w in words:
print(w)
if w not in vecs.vocab:
print("\tOOV")
continue
for x in vecs.similar_by_word(w, topn=n, restrict_vocab=None):
print("\t%s\t%f" % x)
print()
m=firstm

return(m)

In [11]:
words=["man","woman","pen","fox","sit","about"]
word2="happy"
n = 5
nruns = 40
thresh = 0.99


### Example of centroid model with word2vec¶

In [12]:
method="word2vec"
m = show_iterated_centroid(sents, words, n=n, runs=nruns, threshold=thresh, method=method)

#save it for future use
fnamev = fname + "-" + method + ".vecs"
m.wv.save(fnamev)

#we'll make another one for future use as well
(m2,_,_) = iterate_centroid(sents,words, params=(10,10,100,2,0.001), sg=1, n=10, runs=5, threshold=0.99, method="word2vec", show_progress=False)
fnamev2 = fname + "-" + method + "-win10-dim100-thresh10.vecs"
m2.wv.save(fnamev2)

print("New word:",word2)
for x in m.wv.similar_by_word(word2, topn=n, restrict_vocab=None):
print("\t%s\t%f" % x)

Mean rho:	0.75000000000000000000
Mean rho:	0.93333333333333312609
Mean rho:	0.96666666666666645202
Mean rho:	0.98333333333333328152
Mean rho:	0.96666666666666667407
Mean rho:	0.96666666666666645202
Mean rho:	0.99999999999999988898

word2vec centroid model
man
devil	0.853418
gentleman	0.852735
character	0.850795
nobleman	0.847673
sense	0.846414

woman
girl	0.921476
heart	0.902369
creature	0.897813
soul	0.883578
simple	0.880575

pen
permission	0.986083
villain	0.985740
console	0.985627
seek	0.981510
conscience	0.981340

fox
pew	0.988788
reconcilement	0.988715
balance	0.987476
observation	0.987269
diplomatist	0.987256

sit
dine	0.968200
fetch	0.967894
carry	0.957296
ride	0.954534
wait	0.953989

companions	0.863566
kindly	0.860787
doing	0.857813
talking	0.856786
whispered	0.855696

New word: happy
quiet	0.955341
possible	0.952116
thinking	0.948815
pleasant	0.948248
thoughts	0.939583


### Example of centroid model with FastText¶

In [13]:
method="FastText"
m = show_iterated_centroid(sents,words, n=n, runs=nruns, threshold=thresh, method=method)

#save it for future use
fnamev = fname + "-" + method + ".vecs"
m.wv.save(fnamev)

print("New word:",word2)
for x in m.wv.similar_by_word(word2, topn=n, restrict_vocab=None):
print("\t%s\t%f" % x)

Mean rho:	0.94999999999999984457
Mean rho:	0.99999999999999988898

FastText centroid model
man
woman	0.941212
human	0.931187
nobleman	0.928819
irishwoman	0.921156

woman
womanhood	0.956260
human	0.953708
womankind	0.951916
kinswoman	0.951080
gentlewoman	0.949496

pen
pays	0.986974
risen	0.986107
hasten	0.985363
beaten	0.984248
forsaken	0.983837

fox
naivete	0.989489
fowl	0.989349
combat	0.987885
desks	0.987567
cot	0.987088

sit
wait	0.982080
drop	0.978920
sell	0.972062
stanhope	0.970823
run	0.968761

abode	0.907996
overhear	0.902718
tallow	0.902074
jabotiere	0.901981
howl	0.899920

New word: happy
unhappy	0.978972
possibly	0.970741
cleverly	0.969068
far	0.968470
probably	0.967524


## Discussion and Conclusion¶

The differences between word2vec and FastText are quite striking, but they are not surprising, given that FastText is designed to find similarities among morphologically related words (e.g. happy and unhappy). Another thing to note is that in informal testing, FastText seems to converge to the desired threshold for the centroid in fewer interations than word2vec. However, since FastText takes longer to run, there is no clear cut speed advantage.

I should note that the convergence procedure used here is not guaranteed to converge, in particular in the case when the generated models are further from the centroid than the threshold. In practice, using uncommon words can lead to non-convergence, but using medium to very common words seems to work well. An alternative to using the closest n words would be to compare the given word(s) to the entire vocabulary. In addition, instead of having a fixed set of comparison words, we could pick a random set of words instead, maybe at each iteration.

To sum up, using the (approximate) centroid model is a way to have the best of both worlds: we can use the random aspects of the word vector approaches and find a model that smooths over the unstableness of the approaches. Given that it may take several iterations to find the centroid model, this approach may not be feasible for large corpora, but it is definitely feasible for small corpora.

Back to the introduction

Other posts

## References¶

[1] Johannes Hellrich and Udo Hahn. 2016. Bad Company—Neighborhoods in Neural Embedding Spaces Considered Harmful. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 2785–2796, Osaka, Japan, December 11-17 2016.