Word vectors with small corpora:¶

Stabilizing randomness¶

© 2018 Chris Culy, March 2018¶

chrisculy.net ¶

Summary¶

This is one of a series of posts on using word vectors with small corpora. In this post I propose a technique to address the issue of the instability of word vector models due to their random aspects.

Download as Jupyter notebook

Show Code

Background¶

Most of the word vector approaches (e.g. word2vec, FastText, Glove, etc.) will give different results when training them multiple times on the same corpus with the same parameter settings. This is due to the random sampling done as part of the approach. For word2vec and related approaches, the random sampling can occur in at least 3 places:

within a window: words in the context may be sampled (uniformly or not) rather than exhaustively enumerated
negative sampling: to create examples of a non-existent ("negative") context, words from outside the context are sampled
downsizing: high frequency words may be sampled rather than exhaustively enumerated

With high frequency items in large corpora, the difference in results across runs (I'll mean runs with the same parameter settings for the rest of this post) may well be negliglible or even non-existent. However, for lower frequency items, and for small corpora, the differences can be striking. We might say that these approaches are unstable.

Here's an example, where we use Vanity Fair as our corpus, with just under 311,000 words . We'll look for the 5 words that are the most similar to man (high frequency), woman, pen (medium frequency), and fox (low frequency), and we'll compare them across 5 runs of word2vec with identical parameters.

# imports
from gensim import models
from gensim.models.fasttext import FastText
from scipy.stats.stats import spearmanr

#for tables in Jupyter
from IPython.display import HTML, display
import tabulate

# read sentences
fname = 'vanity_fair_pg599.txt-sents-clean.txt'
with open(fname) as f:
    sents = [line.strip().split() for line in f.readlines()]
print('%s: %d sentences' % (fname, len(sents)))

vanity_fair_pg599.txt-sents-clean.txt: 13689 sentences

%%bash
# Info about the text and the target words

f=vanity_fair_pg599.txt-sents-clean.txt
wc $f
w=man
echo "$w:"
grep -c $w $f
w=woman
echo "$w:"
grep -c $w $f
w=pen
echo "$w:"
grep -c $w $f
w=fox
echo "$w:"
grep -c $w $f
w=happy #used later
echo "$w:"
grep -c $w $f

   13689  310721 1671293 vanity_fair_pg599.txt-sents-clean.txt
man:
1799
woman:
346
pen:
415
fox:
12
happy:
162

# make some word2vec models
num_runs = 5
sg = 1 #skip ngram
(min_count,window,size,workers,downsample) = (2,5,20,2,0.001) 

wmodels = [models.Word2Vec(sents, sg=sg, min_count=min_count, window=window, sample=downsample, size=size, workers=2) for i in range(num_runs)]

def compare_tops(vmodels,item,topn=5):    
    tops = []
    for i,m in enumerate(vmodels):
        tops.append( ["Model %d" %i] + ["<b>%s</b> %0.4f" % x  for x in m.wv.similar_by_word(item, topn=topn, restrict_vocab=None)] )

    display(HTML(tabulate.tabulate(tops, tablefmt='html', headers=[item]+list(range(1,topn+1)))))

def compare_all(vmodels,item):
    sims = [m.wv.similar_by_word(item, topn=False) for m in vmodels]
    
    rhos = []
    for i,sim in enumerate(sims):
        row = ["<b>Model %d</b>" % (i+1)]
        for j in range(0,i+1):
            row.append("%0.4f" % spearmanr(sims[i],sims[j])[0])
        rhos.append(row)
      
    display(HTML(tabulate.tabulate(rhos, tablefmt='html', headers=[item]+["Model %d" % i for i in range(1,topn+1)])))

word2vec comparison for high frequency word: man¶

topn=5
wd = 'man'

display(HTML("<h4>%d most similar words</h4>" %topn))
compare_tops(wmodels,wd,topn=topn)

display(HTML("<h4>Spearman rho correlation for ranking of all words compared to %s</h4>" % wd))
compare_all(wmodels,wd)

word2vec comparison for medium frequency word: woman¶

topn=5
wd = 'woman'

display(HTML("<h4>%d most similar words</h4>" %topn))
compare_tops(wmodels,wd,topn=topn)

display(HTML("<h4>Spearman rho correlation for ranking of all words compared to %s</h4>" % wd))
compare_all(wmodels,wd)

word2vec comparison for medium frequency word: pen¶

topn=5
wd = 'pen'

display(HTML("<h4>%d most similar words</h4>" %topn))
compare_tops(wmodels,wd,topn=topn)

display(HTML("<h4>Spearman rho correlation for ranking of all words compared to %s</h4>" % wd))
compare_all(wmodels,wd)

word2vec comparison for low frequency word: fox¶

topn=5
wd = 'fox'

display(HTML("<h4>%d most similar words</h4>" %topn))
compare_tops(wmodels,wd,topn=topn)

display(HTML("<h4>Spearman rho correlation for ranking of all words compared to %s</h4>" % wd))
compare_all(wmodels,wd)

From these examples, we can see that although different runs of word2vec with same parameters are similar (Spearman rho > 0.99), they are not identical. Not only do the similarity numbers differ, but even the relative ranking of the items may differ from one run to the next. For pen, even the most similar word differs across runs. These differences are problematic when we are trying to get an idea of how particular words are used in a small corpus: why should we choose one run over another?

A simple idea¶

There are a few different ways we can avoid the unstableness of word2vec et al. One way is to use an approach that does not use randomization, e.g. the ppmi_svd approach, which we will return to in the next post. Another way is to fix a random seed, so that even though randomness is used, it is the same randomness every time we run the model. Yet another way is to elimate (as much as possible) the use of randomness in an approach, e.g. by not downsizing, not using negative sampling, etc. (See [1] for much more detailed discussion.) However, these "tricks" defeat the purpose of using randomness in the first place, and are thus unsatisifying.

However, we can use the small size of the corpora to our advantage. We can come up with a "consensus" model by taking the average of a number of models, i.e finding the centroid of the models. We can either use a fixed number of models, or we can iteratively update the centroid model by setting a threshold for similarity across iterations.

In the example below, we will iteratively update a centroid model as follows:

We fix the parameters for our model.
We pick some set of words as our words of interest.
We pick some number n to find the most similar words for each of the words of interest.
We compare models' similarity on the n most similar words to each of the words of interest using Spearman's rho on the ranks of those words.
We iterate, updating the average of all the models each time. We compare the current average with the previous average, until the Spearman rho for every word is above a threshold, or we reach a limit on the number of iterations. NB: this is the average (centroid) for the whole model, not just for the words we're interested in. They are only used to judge the convergence on the centroid.
The final average model is what we can then use for further investigations.

Iteratively update the centroid model¶

def compare2models(m1,m2,wd,n=10):
    """
    compute spearman r for the n closest words to wd
    """
    
    if wd not in m1.wv.vocab or wd not in m2.wv.vocab:
        return((float("NaN"),float("NaN"))) #oov
    
    tops1 = m1.wv.similar_by_word(wd, topn=n, restrict_vocab=None)
    tops2 = m2.wv.similar_by_word(wd, topn=n, restrict_vocab=None)
    
    tops_both = set.intersection(set([x[0] for x in tops1]), set([x[0] for x in tops2])) #need intersection since there might be differences in vocab size
    
    ranks1 = [m1.wv.rank(wd,w) for w in tops_both]
    ranks2 = [m2.wv.rank(wd,w) for w in tops_both]
    
    return spearmanr(ranks1, ranks2)
    
def compare_models_words(m1,m2,wds,n=10):
    """
    compute spearman r for the n closest words to to each wd in wds
    """
    
    return( [(wd,compare2models(m1,m2,wd,n)[0]) for wd in wds] )
    
    
def update_centroid(m1,m2,n):
    """
    return m2 *modified* to be the weighted average of (n*m1 + m2)/(n+1)
    i.e. we're using this to iteratively update an average
    """
    
    m2.wv.vectors = (n*m1.wv.vectors + m2.wv.vectors)/(n+1)
    m2.init_sims()
    return(m2)
    
def iterate_centroid(sents,wds, params=(2,5,20,2,0.001), sg=1, n=10, runs=5, threshold=0.99, method="word2vec", show_progress=True):
    """
    iterate runs with same parameters on words with n most similar
        params is: (min_count,window,size,workers,sample) = (2,5,20,2,0.001) 
    after each run, average with previous iteration to create modified model
    
    threshold is the average spearman rho for the wds in the models
    
    method is either "word2vec" or "FastText"
    
    return a list of: the centroid model, whether we converged, and the first model
    """
    
    if method is None or method is "word2vec":
        modeler = models.Word2Vec
    elif method is "FastText":
        modeler = FastText
    else:
        raise ValueError("Unknown method: %s" % method)
    
    (min_count,window,size,workers,sample) = params
    
    firstm = modeler(sents, sg=sg, min_count=min_count, window=window, sample=sample, size=size, workers=2)
    prevm = firstm
    
    for i in range(runs):
        currm = modeler(sents, sg=sg, min_count=min_count, window=window, sample=sample, size=size, workers=workers)
        newm = update_centroid(prevm,currm,i) #currm *must* be second arg, since it gets modified by average_runs
        
        total = 0
        met_thresh = True
        for x in compare_models_words(m1=prevm,m2=newm,wds=wds,n=n):
            total += x[1]
            met_thresh = met_thresh and (x[1] > threshold)
        if show_progress:
            print("Mean rho:\t%0.20f" % (total/len(wds)))

        if met_thresh:
            break
        
        prevm = newm
    return((newm,met_thresh, firstm))

def show_iterated_centroid(sents, wds, n=10, runs=40, threshold=0.99, method="word2vec"):
    """
    show the results of constructing an iterated centroid using wds for the convergence
    
    return the centroid, or the first model if it didn't converge
    """
    
    (m,OK,firstm) = iterate_centroid(sents,words, n=n, runs=nruns, threshold=thresh, method=method)
    if OK:
        print("\n%s centroid model" % method)
        vecs = m.wv
        for w in words:
            print(w)
            if w not in vecs.vocab:
                print("\tOOV")
                continue
            for x in vecs.similar_by_word(w, topn=n, restrict_vocab=None):
                print("\t%s\t%f" % x)
            print()
    else:
        print("Didn't get to each spearman rho of %0.6f" % thresh)
        print("\n%s first model" % method)
        vecs = firstm.wv
        for w in words:
            print(w)
            if w not in vecs.vocab:
                print("\tOOV")
                continue
            for x in vecs.similar_by_word(w, topn=n, restrict_vocab=None):
                print("\t%s\t%f" % x)
            print()
        m=firstm
    
    return(m)

words=["man","woman","pen","fox","sit","about"]
word2="happy"
n = 5
nruns = 40
thresh = 0.99

Example of centroid model with word2vec¶

method="word2vec"
m = show_iterated_centroid(sents, words, n=n, runs=nruns, threshold=thresh, method=method)

#save it for future use
fnamev = fname + "-" + method + ".vecs"
m.wv.save(fnamev)

#we'll make another one for future use as well
(m2,_,_) = iterate_centroid(sents,words, params=(10,10,100,2,0.001), sg=1, n=10, runs=5, threshold=0.99, method="word2vec", show_progress=False)
fnamev2 = fname + "-" + method + "-win10-dim100-thresh10.vecs"
m2.wv.save(fnamev2)


print("New word:",word2)
for x in m.wv.similar_by_word(word2, topn=n, restrict_vocab=None):
    print("\t%s\t%f" % x)

Mean rho:	0.75000000000000000000
Mean rho:	0.93333333333333312609
Mean rho:	0.96666666666666645202
Mean rho:	0.98333333333333328152
Mean rho:	0.96666666666666667407
Mean rho:	0.96666666666666645202
Mean rho:	0.99999999999999988898

word2vec centroid model
man
	devil	0.853418
	gentleman	0.852735
	character	0.850795
	nobleman	0.847673
	sense	0.846414

woman
	girl	0.921476
	heart	0.902369
	creature	0.897813
	soul	0.883578
	simple	0.880575

pen
	permission	0.986083
	villain	0.985740
	console	0.985627
	seek	0.981510
	conscience	0.981340

fox
	pew	0.988788
	reconcilement	0.988715
	balance	0.987476
	observation	0.987269
	diplomatist	0.987256

sit
	dine	0.968200
	fetch	0.967894
	carry	0.957296
	ride	0.954534
	wait	0.953989

about
	companions	0.863566
	kindly	0.860787
	doing	0.857813
	talking	0.856786
	whispered	0.855696

New word: happy
	quiet	0.955341
	possible	0.952116
	thinking	0.948815
	pleasant	0.948248
	thoughts	0.939583

Example of centroid model with FastText¶

method="FastText"
m = show_iterated_centroid(sents,words, n=n, runs=nruns, threshold=thresh, method=method)

#save it for future use
fnamev = fname + "-" + method + ".vecs"
m.wv.save(fnamev)

print("New word:",word2)
for x in m.wv.similar_by_word(word2, topn=n, restrict_vocab=None):
    print("\t%s\t%f" % x)

Mean rho:	0.94999999999999984457
Mean rho:	0.99999999999999988898

FastText centroid model
man
	woman	0.941212
	madman	0.937547
	human	0.931187
	nobleman	0.928819
	irishwoman	0.921156

woman
	womanhood	0.956260
	human	0.953708
	womankind	0.951916
	kinswoman	0.951080
	gentlewoman	0.949496

pen
	pays	0.986974
	risen	0.986107
	hasten	0.985363
	beaten	0.984248
	forsaken	0.983837

fox
	naivete	0.989489
	fowl	0.989349
	combat	0.987885
	desks	0.987567
	cot	0.987088

sit
	wait	0.982080
	drop	0.978920
	sell	0.972062
	stanhope	0.970823
	run	0.968761

about
	abode	0.907996
	overhear	0.902718
	tallow	0.902074
	jabotiere	0.901981
	howl	0.899920

New word: happy
	unhappy	0.978972
	possibly	0.970741
	cleverly	0.969068
	far	0.968470
	probably	0.967524

Discussion and Conclusion¶

The differences between word2vec and FastText are quite striking, but they are not surprising, given that FastText is designed to find similarities among morphologically related words (e.g. happy and unhappy). Another thing to note is that in informal testing, FastText seems to converge to the desired threshold for the centroid in fewer interations than word2vec. However, since FastText takes longer to run, there is no clear cut speed advantage.

I should note that the convergence procedure used here is not guaranteed to converge, in particular in the case when the generated models are further from the centroid than the threshold. In practice, using uncommon words can lead to non-convergence, but using medium to very common words seems to work well. An alternative to using the closest n words would be to compare the given word(s) to the entire vocabulary. In addition, instead of having a fixed set of comparison words, we could pick a random set of words instead, maybe at each iteration.

To sum up, using the (approximate) centroid model is a way to have the best of both worlds: we can use the random aspects of the word vector approaches and find a model that smooths over the unstableness of the approaches. Given that it may take several iterations to find the centroid model, this approach may not be feasible for large corpora, but it is definitely feasible for small corpora.

Back to the introduction

References¶

[1] Johannes Hellrich and Udo Hahn. 2016. Bad Company—Neighborhoods in Neural Embedding Spaces Considered Harmful. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 2785–2796, Osaka, Japan, December 11-17 2016.

man	1	2	3	4	5
Model 0	gentleman 0.8581	devil 0.8446	character 0.8424	fellow 0.8402	nobleman 0.8380
Model 1	gentleman 0.8539	nobleman 0.8519	devil 0.8506	character 0.8501	sense 0.8466
Model 2	gentleman 0.8537	devil 0.8529	character 0.8503	nobleman 0.8491	true 0.8455
Model 3	devil 0.8532	gentleman 0.8511	character 0.8481	nobleman 0.8474	sense 0.8461
Model 4	devil 0.8511	gentleman 0.8510	character 0.8497	nobleman 0.8474	true 0.8435

man	Model 1	Model 2	Model 3	Model 4	Model 5
Model 1	1
Model 2	0.9983	1
Model 3	0.9987	0.9988	1
Model 4	0.9991	0.9987	0.9991	1
Model 5	0.9985	0.9985	0.9987	0.9994	1

woman	1	2	3	4	5
Model 0	girl 0.9211	heart 0.8999	creature 0.8954	soul 0.8808	simple 0.8760
Model 1	girl 0.9217	creature 0.8988	heart 0.8985	soul 0.8818	simple 0.8813
Model 2	girl 0.9217	heart 0.8981	creature 0.8962	soul 0.8816	simple 0.8793
Model 3	girl 0.9221	heart 0.8983	creature 0.8957	soul 0.8819	simple 0.8816
Model 4	girl 0.9221	heart 0.9034	creature 0.8990	soul 0.8879	simple 0.8848

woman	Model 1	Model 2	Model 3	Model 4	Model 5
Model 1	1
Model 2	0.999	1
Model 3	0.999	0.9993	1
Model 4	0.9994	0.9991	0.9993	1
Model 5	0.9991	0.9991	0.9992	0.9996	1

pen	1	2	3	4	5
Model 0	permission 0.9862	villain 0.9861	console 0.9854	conscience 0.9813	strength 0.9810
Model 1	permission 0.9867	console 0.9854	villain 0.9853	seek 0.9814	conscience 0.9810
Model 2	permission 0.9859	villain 0.9859	console 0.9849	conscience 0.9823	seek 0.9809
Model 3	permission 0.9861	villain 0.9857	console 0.9856	conscience 0.9815	seek 0.9807
Model 4	villain 0.9859	permission 0.9858	console 0.9848	conscience 0.9815	seek 0.9805

Word vectors with small corpora:¶

Stabilizing randomness¶

© 2018 Chris Culy, March 2018¶

chrisculy.net ¶

Summary¶

Background¶

word2vec comparison for high frequency word: man¶

5 most similar words

Spearman rho correlation for ranking of all words compared to man

word2vec comparison for medium frequency word: woman¶

5 most similar words

Spearman rho correlation for ranking of all words compared to woman

word2vec comparison for medium frequency word: pen¶

5 most similar words

Spearman rho correlation for ranking of all words compared to pen

word2vec comparison for low frequency word: fox¶

5 most similar words

Spearman rho correlation for ranking of all words compared to fox

A simple idea¶

Iteratively update the centroid model¶

Example of centroid model with word2vec¶

Example of centroid model with FastText¶

Discussion and Conclusion¶

References¶

fox	1	2	3	4	5
Model 0	reconcilement 0.9888	pew 0.9887	milor 0.9879	balance 0.9878	mild 0.9875
Model 1	reconcilement 0.9890	pew 0.9884	observation 0.9879	diplomatist 0.9878	evidently 0.9876
Model 2	reconcilement 0.9891	pew 0.9884	balance 0.9882	observation 0.9881	diplomatist 0.9880
Model 3	reconcilement 0.9884	pew 0.9883	balance 0.9873	diplomatist 0.9872	evidently 0.9871
Model 4	reconcilement 0.9887	pew 0.9882	milor 0.9873	balance 0.9873	observation 0.9872

Word vectors with small corpora:¶

Stabilizing randomness¶

© 2018 Chris Culy, March 2018¶

chrisculy.net¶

Summary¶

Background¶

word2vec comparison for high frequency word: man¶

5 most similar words

Spearman rho correlation for ranking of all words compared to man

word2vec comparison for medium frequency word: woman¶

5 most similar words

Spearman rho correlation for ranking of all words compared to woman

word2vec comparison for medium frequency word: pen¶

5 most similar words

Spearman rho correlation for ranking of all words compared to pen

word2vec comparison for low frequency word: fox¶

5 most similar words

Spearman rho correlation for ranking of all words compared to fox

A simple idea¶

Iteratively update the centroid model¶

Example of centroid model with word2vec¶

Example of centroid model with FastText¶

Discussion and Conclusion¶

References¶

chrisculy.net ¶