# A new measure for evaluation¶

## Overview¶

This is one of a series of posts on using word vectors with small corpora. In this post I propose a simple measure for evaluating word vectors to take into account the limited vocabulary of small corpora.

Show Code

## Background¶

One way of evaluating word vectors is to apply them to tasks that people do, and compare those results to people's results. Two common tasks are similarity (or relatedness) and analogies. For similarites, the task is to judge how similar (or related) two words are. For analogies, the task is to fill in the missing term in a series "A is to B as C is to —". In this post, I will focus on similarity, but the measure proposed here can be applied to analogies as well, and to other types of evaluation. In the discussion below, following [1], I will use four standard testsets: ws353 [2], ws353_similarity [3], ws353_relatedness [4], and bruni_men [5].

Here are a couple of examples of applying the word vectors to the word similarity task, using the centroid models for Vanity Fair that we created previously

In [1]:
from gensim import models

#for tables in Jupyter
from IPython.display import HTML, display
import tabulate

In [2]:
vfair_w2v_f = 'vanity_fair_pg599.txt-sents-clean.txt-word2vec.vecs'
vfair_ft_f = 'vanity_fair_pg599.txt-sents-clean.txt-FastText.vecs'

def compare_pairs(vecs,pairs):
what = [list(p) + [vecs.similarity(p[0],p[1])] for p in pairs]
#print(what)

In [3]:
our_pairs = (('woman','girl'),('woman','man'),('happy','sad'),('happy','unhappy'),('sad','unhappy'))

display(HTML('<b>Similarity of pairs of words, using word2vec centroid model of Vanity Fair</b>'))
compare_pairs(vfair_w2v,our_pairs)

Similarity of pairs of words, using word2vec centroid model of Vanity Fair
word1 word2 similarity
woman girl 0.910268
woman man 0.831149
happy unhappy 0.897276
In [4]:
display(HTML('<b>Similarity of pairs of words, using FastText centroid model of Vanity Fair</b>'))
compare_pairs(vfair_ft,our_pairs)

Similarity of pairs of words, using FastText centroid model of Vanity Fair
word1 word2 similarity
woman girl 0.918686
woman man 0.92156
happy unhappy 0.980426

We can see that there are differences between the word2vec model and the FastText model. For example, word2vec scores girl as being more similar to woman than man is, while FastText is the opposite. Since absolute scores across models are not comparable, what is important is the relative ranking:

• word2vec for woman: girl > man
• FastText for woman: man > girl

There has been considerable work in compiling human similarity judgments, using different criteria, including varying between "similar" (e.g. happy and glad) and "related" (e.g. happy and sad). The result is lists of pairs of words with a score reflecting human judgements of their similarity. The evaluation then consists of using the word vectors to compute similarities (as above) and then comparing the rankings of the humans and the word vectors, typically using the Spearman rho measure of correlation. Here's an example using the standard "ws353.txt" set of 353 word pairs.

In [5]:
testsetdir = "testsets/ws/"

def eval_word_pairs_basic(vecs,testset):
"""
return Spearman rho, recall
"""

pairs = testsetdir + testset
with open(pairs) as f:

#triple (pearson, spearman, ratio of pairs with unknown words). [from documentation]
results = vecs.evaluate_word_pairs(pairs, restrict_vocab=len(vecs.vocab), case_insensitive=True)

rho = results[1][0]
fnd = num_tests - round(num_tests * results[2]/100)

recall = fnd/num_tests

return(rho, recall)


In [6]:
testset = "ws353.txt"

what = [["word2vec"] + list(eval_word_pairs_basic(vfair_w2v,testset)),
["FastText"] + list(eval_word_pairs_basic(vfair_ft,testset))]

display(HTML('<b>Correlation with %s similarity testset</b>' % testset))

Correlation with ws353.txt similarity testset
Spearman rho recall
word2vec -0.0472890.362606
FastText 0.0746890.362606

There is a slight negative correlation for word2vec and a slight positive correlation for FastText. However, we can also see that only 36.3% of the word pairs were found (the recall) in the word vectors for Vanity Fair. The problem is that word vector testsets are designed to test large corpora of (mainly) contemporary language, while we have a small corpus of 19th century language.

To help better compare word vectors for small corpora, we might like to combine the Spearman rho and recall into a single measure. In other areas of computational linguistics, the F measure is used, which combines precision and recall:

$$F1 = \frac{2 * precision * recall}{precision + recall}$$

I propose using an analogy to the standard F measure. However, since similarity and relatedness are ranked measures, we don't have precision, but rather (typically) the Spearman $\rho$ measure of correlation. We can use this as a proxy for precision, with a slight adjustment. $\rho$ ranges from [-1,1], but for "precision" we need a value in the range [0,1]. We can scale $\rho$ to get a new value $\rho'$ which is in that range:

$$\rho' = \frac{(1 + \rho)}{2}$$

(Another possibility, not explored here, would be to compress all negative spearman values to 0.) Then we can formulate our new sF1 measure as:

$$sF1 = \frac{2 * \rho' * recall}{\rho' + recall}$$

Of course, we could have the usual family of F scores, but sF1 will suffice here.

We can use the sF1 score to compare results across different sets of word vectors and different test sets.

In [7]:
def eval_word_pairs_sF1(vecs,testset):
"""
return s-F1, Spearman rho, recall
"""

pairs = testsetdir + testset
with open(pairs) as f:

#triple (pearson, spearman, ratio of pairs with unknown words). [from documentation]
results = vecs.evaluate_word_pairs(pairs, restrict_vocab=len(vecs.vocab), case_insensitive=True)

rho = results[1][0]
fnd = num_tests - round(num_tests * results[2]/100)

recall = fnd/num_tests
scorrelation = (1+rho)/2

sF1 = 2 * scorrelation * recall / (scorrelation + recall)

return (sF1,rho,recall)

In [8]:
what = [["<b>Testset</b>","<b>sF1</b>","<b>Spearman</b>","<b>recall</b>","<b>sF1</b>","<b>Spearman</b>","<b>recall</b>"]]
for ts in ["ws353.txt", "ws353_similarity.txt", "ws353_relatedness.txt", "bruni_men.txt"]:
r1 = [round(val,5) for val in eval_word_pairs_sF1(vfair_w2v,ts)]
r2 = [round(val,5) for val in eval_word_pairs_sF1(vfair_ft,ts)]
what.append([ts] + r1 + r2)

display(HTML('<b>Evaluation of similarity with various testsets using word2vec and FastText</b>'))

Evaluation of similarity with various testsets using word2vec and FastText
word2vec FastText
Testset sF1SpearmanrecallsF1Spearmanrecall
ws353.txt 0.41177 -0.04729 0.36261 0.43301 0.07469 0.36261
ws353_similarity.txt 0.42673 0.01002 0.36946 0.43539 0.05995 0.36946
ws353_relatedness.txt0.41134 -0.08313 0.37302 0.43734 0.05694 0.37302
bruni_men.txt 0.49454 0.20521 0.41933 0.49299 0.19609 0.41933

While the differences between word2vec and FastText here are not that striking, they are, after all, using the same parameters. However, when we compare one set of parameters to another, the differences can be stronger.

Here we compare two word2vec models, our original, and a second one.

In [9]:
display(HTML('<b>Parameters for two word2vec models</b>'))
display(HTML('<table><tr><th>model</th><th>min_count</th><th>window</th><th>dimensions</th></tr><tr><td>original</td><td>2</td><td>5</td><td>20</td></tr><tr><td>model2</td><td>10</td><td>10</td><td>100</td></tr></table>'))

Parameters for two word2vec models
modelmin_countwindowdimensions
original2520
model21010100
In [10]:
vfair_w2v2 = models.KeyedVectors.load('vanity_fair_pg599.txt-sents-clean.txt-word2vec-win10-dim100-thresh10.vecs')

what = [["<b>Testset</b>","<b>sF1</b>","<b>Spearman</b>","<b>recall</b>","<b>sF1</b>","<b>Spearman</b>","<b>recall</b>"]]
for ts in ["ws353.txt", "ws353_similarity.txt", "ws353_relatedness.txt", "bruni_men.txt"]:
r1 = [round(val,5) for val in eval_word_pairs_sF1(vfair_w2v,ts)]
r2 = [round(val,5) for val in eval_word_pairs_sF1(vfair_w2v2,ts)]
what.append([ts] + r1 + r2)

display(HTML('<b>Evaluation of similarity with various testsets with two different word2vec models</b>'))

Evaluation of similarity with various testsets with two different word2vec models
original model2
Testset sF1SpearmanrecallsF1Spearmanrecall
ws353.txt 0.41177 -0.04729 0.36261 0.26883 0.14546 0.17564
ws353_similarity.txt 0.42673 0.01002 0.36946 0.234 0.12348 0.14778
ws353_relatedness.txt0.41134 -0.08313 0.37302 0.2856 0.14097 0.19048
bruni_men.txt 0.49454 0.20521 0.41933 0.28953 0.45708 0.18067

The testset bruni_men clearly shows the importance of the sF1 measure. Model2 has a much higher Spearman correlation than the original model, but a much lower recall. As a consequence, the sF1 score for model2 is much lower than the sF1 score for the original model.

The other thing to note is that the performance of each model varies widely across the testsets. This type of variation is also seen using large corpora. However, with small corpora, the issue of low recall is more important, so the use of the sF1 score lets us take recall into account directly.

## Discussion and Conclusion¶

The literature and tools for evaluating word vectors on similarity (and analogies and others) are based on the assumption that the word vectors will contain nearly all of the words being tested, and so the rank correlation is (fairly) relied on for comparison. However with small corpora, this assumption does not hold, and so we need another measure, like sF1, to make more useful comparisons.

Back to the introduction

Other posts

## References¶

The testsets are included with the hyperwords package, while the base evaluation is done using gensim. sF1 is my own calculation, obviously

[1] Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving Distributional Similarity with Lessons Learned from Word Embeddings. Transactions of the Association for Computational Linguistics, vol. 3, pp. 211–225.

[2] Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2002. Placing search in context: The concept revisited. ACM Transactions on Information Systems, 20(1):116–131.

[3] Torsten Zesch, Christof Müller, and Iryna Gurevych. 2008. Using wiktionary for computing semantic relatedness. In Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 2, AAAI’08, 861–866. AAAI Press.

[4] Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pasca, and Aitor Soroa. 2009. A study on similarity and relatedness using distributional and wordnet-based approaches. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 19–27, Boulder, Colorado, June. Association for Computational Linguistics.

[5] Elia Bruni, Gemma Boleda, Marco Baroni, and Nam Khanh Tran. 2012. Distributional semantics in technicolor. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 136–145, Jeju Island, Korea, July. Association for Computational Linguistics.

[6] Gensim: https://radimrehurek.com/gensim/, published as: Software Framework for Topic Modelling with Large Corpora. Radim Řehůřek and Petr Sojka, Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45-50, 22 May 2010.

[7] Hyperwords: https://bitbucket.org/omerlevy/hyperwords, published as [1]