Word vectors with small corpora:

Stabilizing randomness

© 2018 Chris Culy, March 2018

chrisculy.net

Summary

This is one of a series of posts on using word vectors with small corpora. In this post I propose a technique to address the issue of the instability of word vector models due to their random aspects.

Download as Jupyter notebook

Show Code

Background

Most of the word vector approaches (e.g. word2vec, FastText, Glove, etc.) will give different results when training them multiple times on the same corpus with the same parameter settings. This is due to the random sampling done as part of the approach. For word2vec and related approaches, the random sampling can occur in at least 3 places:

  • within a window: words in the context may be sampled (uniformly or not) rather than exhaustively enumerated
  • negative sampling: to create examples of a non-existent ("negative") context, words from outside the context are sampled
  • downsizing: high frequency words may be sampled rather than exhaustively enumerated

With high frequency items in large corpora, the difference in results across runs (I'll mean runs with the same parameter settings for the rest of this post) may well be negliglible or even non-existent. However, for lower frequency items, and for small corpora, the differences can be striking. We might say that these approaches are unstable.

Here's an example, where we use Vanity Fair as our corpus, with just under 311,000 words . We'll look for the 5 words that are the most similar to man (high frequency), woman, pen (medium frequency), and fox (low frequency), and we'll compare them across 5 runs of word2vec with identical parameters.

In [1]:
# imports
from gensim import models
from gensim.models.fasttext import FastText
from scipy.stats.stats import spearmanr

#for tables in Jupyter
from IPython.display import HTML, display
import tabulate
In [2]:
# read sentences
fname = 'vanity_fair_pg599.txt-sents-clean.txt'
with open(fname) as f:
    sents = [line.strip().split() for line in f.readlines()]
print('%s: %d sentences' % (fname, len(sents)))
vanity_fair_pg599.txt-sents-clean.txt: 13689 sentences
In [3]:
%%bash
# Info about the text and the target words

f=vanity_fair_pg599.txt-sents-clean.txt
wc $f
w=man
echo "$w:"
grep -c $w $f
w=woman
echo "$w:"
grep -c $w $f
w=pen
echo "$w:"
grep -c $w $f
w=fox
echo "$w:"
grep -c $w $f
w=happy #used later
echo "$w:"
grep -c $w $f
   13689  310721 1671293 vanity_fair_pg599.txt-sents-clean.txt
man:
1799
woman:
346
pen:
415
fox:
12
happy:
162
In [4]:
# make some word2vec models
num_runs = 5
sg = 1 #skip ngram
(min_count,window,size,workers,downsample) = (2,5,20,2,0.001) 

wmodels = [models.Word2Vec(sents, sg=sg, min_count=min_count, window=window, sample=downsample, size=size, workers=2) for i in range(num_runs)]
In [5]:
def compare_tops(vmodels,item,topn=5):    
    tops = []
    for i,m in enumerate(vmodels):
        tops.append( ["Model %d" %i] + ["<b>%s</b> %0.4f" % x  for x in m.wv.similar_by_word(item, topn=topn, restrict_vocab=None)] )

    display(HTML(tabulate.tabulate(tops, tablefmt='html', headers=[item]+list(range(1,topn+1)))))

def compare_all(vmodels,item):
    sims = [m.wv.similar_by_word(item, topn=False) for m in vmodels]
    
    rhos = []
    for i,sim in enumerate(sims):
        row = ["<b>Model %d</b>" % (i+1)]
        for j in range(0,i+1):
            row.append("%0.4f" % spearmanr(sims[i],sims[j])[0])
        rhos.append(row)
      
    display(HTML(tabulate.tabulate(rhos, tablefmt='html', headers=[item]+["Model %d" % i for i in range(1,topn+1)])))

word2vec comparison for high frequency word: man

In [6]:
topn=5
wd = 'man'

display(HTML("<h4>%d most similar words</h4>" %topn))
compare_tops(wmodels,wd,topn=topn)

display(HTML("<h4>Spearman rho correlation for ranking of all words compared to %s</h4>" % wd))
compare_all(wmodels,wd)

5 most similar words

man 1 2 3 4 5
Model 0gentleman 0.8581devil 0.8446 character 0.8424fellow 0.8402 nobleman 0.8380
Model 1gentleman 0.8539nobleman 0.8519 devil 0.8506 character 0.8501sense 0.8466
Model 2gentleman 0.8537devil 0.8529 character 0.8503nobleman 0.8491 true 0.8455
Model 3devil 0.8532 gentleman 0.8511character 0.8481nobleman 0.8474 sense 0.8461
Model 4devil 0.8511 gentleman 0.8510character 0.8497nobleman 0.8474 true 0.8435

Spearman rho correlation for ranking of all words compared to man

man Model 1 Model 2 Model 3 Model 4 Model 5
Model 1 1
Model 2 0.9983 1
Model 3 0.9987 0.9988 1
Model 4 0.9991 0.9987 0.9991 1
Model 5 0.9985 0.9985 0.9987 0.9994 1

word2vec comparison for medium frequency word: woman

In [7]:
topn=5
wd = 'woman'

display(HTML("<h4>%d most similar words</h4>" %topn))
compare_tops(wmodels,wd,topn=topn)

display(HTML("<h4>Spearman rho correlation for ranking of all words compared to %s</h4>" % wd))
compare_all(wmodels,wd)

5 most similar words

woman 1 2 3 4 5
Model 0girl 0.9211heart 0.8999 creature 0.8954soul 0.8808simple 0.8760
Model 1girl 0.9217creature 0.8988heart 0.8985 soul 0.8818simple 0.8813
Model 2girl 0.9217heart 0.8981 creature 0.8962soul 0.8816simple 0.8793
Model 3girl 0.9221heart 0.8983 creature 0.8957soul 0.8819simple 0.8816
Model 4girl 0.9221heart 0.9034 creature 0.8990soul 0.8879simple 0.8848

Spearman rho correlation for ranking of all words compared to woman

woman Model 1 Model 2 Model 3 Model 4 Model 5
Model 1 1
Model 2 0.999 1
Model 3 0.999 0.9993 1
Model 4 0.9994 0.9991 0.9993 1
Model 5 0.9991 0.9991 0.9992 0.9996 1

word2vec comparison for medium frequency word: pen

In [8]:
topn=5
wd = 'pen'

display(HTML("<h4>%d most similar words</h4>" %topn))
compare_tops(wmodels,wd,topn=topn)

display(HTML("<h4>Spearman rho correlation for ranking of all words compared to %s</h4>" % wd))
compare_all(wmodels,wd)

5 most similar words

pen 1 2 3 4 5
Model 0permission 0.9862villain 0.9861 console 0.9854conscience 0.9813strength 0.9810
Model 1permission 0.9867console 0.9854 villain 0.9853seek 0.9814 conscience 0.9810
Model 2permission 0.9859villain 0.9859 console 0.9849conscience 0.9823seek 0.9809
Model 3permission 0.9861villain 0.9857 console 0.9856conscience 0.9815seek 0.9807
Model 4villain 0.9859 permission 0.9858console 0.9848conscience 0.9815seek 0.9805

Spearman rho correlation for ranking of all words compared to pen

pen Model 1 Model 2 Model 3 Model 4 Model 5
Model 1 1
Model 2 0.9988 1
Model 3 0.9989 0.9992 1
Model 4 0.9992 0.9989 0.9991 1
Model 5 0.9991 0.9989 0.9992 0.9997 1

word2vec comparison for low frequency word: fox

In [9]:
topn=5
wd = 'fox'

display(HTML("<h4>%d most similar words</h4>" %topn))
compare_tops(wmodels,wd,topn=topn)

display(HTML("<h4>Spearman rho correlation for ranking of all words compared to %s</h4>" % wd))
compare_all(wmodels,wd)

5 most similar words

fox 1 2 3 4 5
Model 0reconcilement 0.9888pew 0.9887milor 0.9879 balance 0.9878 mild 0.9875
Model 1reconcilement 0.9890pew 0.9884observation 0.9879diplomatist 0.9878evidently 0.9876
Model 2reconcilement 0.9891pew 0.9884balance 0.9882 observation 0.9881diplomatist 0.9880
Model 3reconcilement 0.9884pew 0.9883balance 0.9873 diplomatist 0.9872evidently 0.9871
Model 4reconcilement 0.9887pew 0.9882milor 0.9873 balance 0.9873 observation 0.9872

Spearman rho correlation for ranking of all words compared to fox

fox Model 1 Model 2 Model 3 Model 4 Model 5
Model 1 1
Model 2 0.9983 1
Model 3 0.9984 0.9989 1
Model 4 0.9988 0.9987 0.9988 1
Model 5 0.9985 0.9984 0.9986 0.9995 1

From these examples, we can see that although different runs of word2vec with same parameters are similar (Spearman rho > 0.99), they are not identical. Not only do the similarity numbers differ, but even the relative ranking of the items may differ from one run to the next. For pen, even the most similar word differs across runs. These differences are problematic when we are trying to get an idea of how particular words are used in a small corpus: why should we choose one run over another?

A simple idea

There are a few different ways we can avoid the unstableness of word2vec et al. One way is to use an approach that does not use randomization, e.g. the ppmi_svd approach, which we will return to in the next post. Another way is to fix a random seed, so that even though randomness is used, it is the same randomness every time we run the model. Yet another way is to elimate (as much as possible) the use of randomness in an approach, e.g. by not downsizing, not using negative sampling, etc. (See [1] for much more detailed discussion.) However, these "tricks" defeat the purpose of using randomness in the first place, and are thus unsatisifying.

However, we can use the small size of the corpora to our advantage. We can come up with a "consensus" model by taking the average of a number of models, i.e finding the centroid of the models. We can either use a fixed number of models, or we can iteratively update the centroid model by setting a threshold for similarity across iterations.

In the example below, we will iteratively update a centroid model as follows:

  • We fix the parameters for our model.
  • We pick some set of words as our words of interest.
  • We pick some number n to find the most similar words for each of the words of interest.
  • We compare models' similarity on the n most similar words to each of the words of interest using Spearman's rho on the ranks of those words.
  • We iterate, updating the average of all the models each time. We compare the current average with the previous average, until the Spearman rho for every word is above a threshold, or we reach a limit on the number of iterations. NB: this is the average (centroid) for the whole model, not just for the words we're interested in. They are only used to judge the convergence on the centroid.
  • The final average model is what we can then use for further investigations.

Iteratively update the centroid model

In [10]:
def compare2models(m1,m2,wd,n=10):
    """
    compute spearman r for the n closest words to wd
    """
    
    if wd not in m1.wv.vocab or wd not in m2.wv.vocab:
        return((float("NaN"),float("NaN"))) #oov
    
    tops1 = m1.wv.similar_by_word(wd, topn=n, restrict_vocab=None)
    tops2 = m2.wv.similar_by_word(wd, topn=n, restrict_vocab=None)
    
    tops_both = set.intersection(set([x[0] for x in tops1]), set([x[0] for x in tops2])) #need intersection since there might be differences in vocab size
    
    ranks1 = [m1.wv.rank(wd,w) for w in tops_both]
    ranks2 = [m2.wv.rank(wd,w) for w in tops_both]
    
    return spearmanr(ranks1, ranks2)
    
def compare_models_words(m1,m2,wds,n=10):
    """
    compute spearman r for the n closest words to to each wd in wds
    """
    
    return( [(wd,compare2models(m1,m2,wd,n)[0]) for wd in wds] )
    
    
def update_centroid(m1,m2,n):
    """
    return m2 *modified* to be the weighted average of (n*m1 + m2)/(n+1)
    i.e. we're using this to iteratively update an average
    """
    
    m2.wv.vectors = (n*m1.wv.vectors + m2.wv.vectors)/(n+1)
    m2.init_sims()
    return(m2)
    
def iterate_centroid(sents,wds, params=(2,5,20,2,0.001), sg=1, n=10, runs=5, threshold=0.99, method="word2vec", show_progress=True):
    """
    iterate runs with same parameters on words with n most similar
        params is: (min_count,window,size,workers,sample) = (2,5,20,2,0.001) 
    after each run, average with previous iteration to create modified model
    
    threshold is the average spearman rho for the wds in the models
    
    method is either "word2vec" or "FastText"
    
    return a list of: the centroid model, whether we converged, and the first model
    """
    
    if method is None or method is "word2vec":
        modeler = models.Word2Vec
    elif method is "FastText":
        modeler = FastText
    else:
        raise ValueError("Unknown method: %s" % method)
    
    (min_count,window,size,workers,sample) = params
    
    firstm = modeler(sents, sg=sg, min_count=min_count, window=window, sample=sample, size=size, workers=2)
    prevm = firstm
    
    for i in range(runs):
        currm = modeler(sents, sg=sg, min_count=min_count, window=window, sample=sample, size=size, workers=workers)
        newm = update_centroid(prevm,currm,i) #currm *must* be second arg, since it gets modified by average_runs
        
        total = 0
        met_thresh = True
        for x in compare_models_words(m1=prevm,m2=newm,wds=wds,n=n):
            total += x[1]
            met_thresh = met_thresh and (x[1] > threshold)
        if show_progress:
            print("Mean rho:\t%0.20f" % (total/len(wds)))

        if met_thresh:
            break
        
        prevm = newm
    return((newm,met_thresh, firstm))

def show_iterated_centroid(sents, wds, n=10, runs=40, threshold=0.99, method="word2vec"):
    """
    show the results of constructing an iterated centroid using wds for the convergence
    
    return the centroid, or the first model if it didn't converge
    """
    
    (m,OK,firstm) = iterate_centroid(sents,words, n=n, runs=nruns, threshold=thresh, method=method)
    if OK:
        print("\n%s centroid model" % method)
        vecs = m.wv
        for w in words:
            print(w)
            if w not in vecs.vocab:
                print("\tOOV")
                continue
            for x in vecs.similar_by_word(w, topn=n, restrict_vocab=None):
                print("\t%s\t%f" % x)
            print()
    else:
        print("Didn't get to each spearman rho of %0.6f" % thresh)
        print("\n%s first model" % method)
        vecs = firstm.wv
        for w in words:
            print(w)
            if w not in vecs.vocab:
                print("\tOOV")
                continue
            for x in vecs.similar_by_word(w, topn=n, restrict_vocab=None):
                print("\t%s\t%f" % x)
            print()
        m=firstm
    
    return(m)
In [11]:
words=["man","woman","pen","fox","sit","about"]
word2="happy"
n = 5
nruns = 40
thresh = 0.99

Example of centroid model with word2vec

In [12]:
method="word2vec"
m = show_iterated_centroid(sents, words, n=n, runs=nruns, threshold=thresh, method=method)

#save it for future use
fnamev = fname + "-" + method + ".vecs"
m.wv.save(fnamev)

#we'll make another one for future use as well
(m2,_,_) = iterate_centroid(sents,words, params=(10,10,100,2,0.001), sg=1, n=10, runs=5, threshold=0.99, method="word2vec", show_progress=False)
fnamev2 = fname + "-" + method + "-win10-dim100-thresh10.vecs"
m2.wv.save(fnamev2)


print("New word:",word2)
for x in m.wv.similar_by_word(word2, topn=n, restrict_vocab=None):
    print("\t%s\t%f" % x)
Mean rho:	0.75000000000000000000
Mean rho:	0.93333333333333312609
Mean rho:	0.96666666666666645202
Mean rho:	0.98333333333333328152
Mean rho:	0.96666666666666667407
Mean rho:	0.96666666666666645202
Mean rho:	0.99999999999999988898

word2vec centroid model
man
	devil	0.853418
	gentleman	0.852735
	character	0.850795
	nobleman	0.847673
	sense	0.846414

woman
	girl	0.921476
	heart	0.902369
	creature	0.897813
	soul	0.883578
	simple	0.880575

pen
	permission	0.986083
	villain	0.985740
	console	0.985627
	seek	0.981510
	conscience	0.981340

fox
	pew	0.988788
	reconcilement	0.988715
	balance	0.987476
	observation	0.987269
	diplomatist	0.987256

sit
	dine	0.968200
	fetch	0.967894
	carry	0.957296
	ride	0.954534
	wait	0.953989

about
	companions	0.863566
	kindly	0.860787
	doing	0.857813
	talking	0.856786
	whispered	0.855696

New word: happy
	quiet	0.955341
	possible	0.952116
	thinking	0.948815
	pleasant	0.948248
	thoughts	0.939583

Example of centroid model with FastText

In [13]:
method="FastText"
m = show_iterated_centroid(sents,words, n=n, runs=nruns, threshold=thresh, method=method)

#save it for future use
fnamev = fname + "-" + method + ".vecs"
m.wv.save(fnamev)

print("New word:",word2)
for x in m.wv.similar_by_word(word2, topn=n, restrict_vocab=None):
    print("\t%s\t%f" % x)
Mean rho:	0.94999999999999984457
Mean rho:	0.99999999999999988898

FastText centroid model
man
	woman	0.941212
	madman	0.937547
	human	0.931187
	nobleman	0.928819
	irishwoman	0.921156

woman
	womanhood	0.956260
	human	0.953708
	womankind	0.951916
	kinswoman	0.951080
	gentlewoman	0.949496

pen
	pays	0.986974
	risen	0.986107
	hasten	0.985363
	beaten	0.984248
	forsaken	0.983837

fox
	naivete	0.989489
	fowl	0.989349
	combat	0.987885
	desks	0.987567
	cot	0.987088

sit
	wait	0.982080
	drop	0.978920
	sell	0.972062
	stanhope	0.970823
	run	0.968761

about
	abode	0.907996
	overhear	0.902718
	tallow	0.902074
	jabotiere	0.901981
	howl	0.899920

New word: happy
	unhappy	0.978972
	possibly	0.970741
	cleverly	0.969068
	far	0.968470
	probably	0.967524

Discussion and Conclusion

The differences between word2vec and FastText are quite striking, but they are not surprising, given that FastText is designed to find similarities among morphologically related words (e.g. happy and unhappy). Another thing to note is that in informal testing, FastText seems to converge to the desired threshold for the centroid in fewer interations than word2vec. However, since FastText takes longer to run, there is no clear cut speed advantage.

I should note that the convergence procedure used here is not guaranteed to converge, in particular in the case when the generated models are further from the centroid than the threshold. In practice, using uncommon words can lead to non-convergence, but using medium to very common words seems to work well. An alternative to using the closest n words would be to compare the given word(s) to the entire vocabulary. In addition, instead of having a fixed set of comparison words, we could pick a random set of words instead, maybe at each iteration.

To sum up, using the (approximate) centroid model is a way to have the best of both worlds: we can use the random aspects of the word vector approaches and find a model that smooths over the unstableness of the approaches. Given that it may take several iterations to find the centroid model, this approach may not be feasible for large corpora, but it is definitely feasible for small corpora.

Back to the introduction

Other posts

References

[1] Johannes Hellrich and Udo Hahn. 2016. Bad Company—Neighborhoods in Neural Embedding Spaces Considered Harmful. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 2785–2796, Osaka, Japan, December 11-17 2016.