© 2018 Chris Culy, March 2018

Overview

This is one of a series of posts on using word vectors with small corpora. In this post I discuss one approach to finding useful parameters and methods, by evaluating word vector models against standard testbeds.

The surprising results are that including all words is more effective than eliminating low frequency words, and that the choice of window size and the number of dimensions is not significant for these small corpora.

Background

In the previous sections (Stabilizing randomness and A new measure for evaluation), we have discussed three different word vector models: positive point-wise mutual information + SVD (ppmi_svd), word2vec, and FastText. There are various parameters to set when constructing a word vector model. Therefore, we would like to know which models and which settings will be the most useful.

As mentioned in the evaluation post, one way of evaluating word vectors is to apply them to tasks that people do, and compare those results to people’s results. Two common tasks are similarity (or relatedness) and analogies. For similarities, the task is to judge how similar (or related) two words are. For analogies, the task is to fill in the missing term in a series “A is to B as C is to —”. In this post, I will focus on similarity, and following [1], I will use four standard testsets: ws353 [2], ws353_similarity [3], ws353_relatedness [4], and bruni_men [5].

Methodology

Approach

My general approach is similar to that in [1], namely try all combinations of some set of parameter values (i.e. a grid search), evaluating each one with each model. However, since the centroid models are significantly slower to calculate than the ppmi_svd models, I will focus on the ppmi_svd models. As an example of the relative times, here are the times to do the 100 combinations of parameters discussed below applied to the novel Vanity Fair, using a 2.9 GHz Intel Core i7 laptop with 8GB of RAM:

  • ppmi_svd: ~ 2 hours
  • word2vec: ~ 5 hours
  • FastText: ~ 14.75 hours

While there are multiple parameters to set when setting up a word vector model, I will focus here on the parameters of window size (win), number of dimensions (dim), and the minimal count for items to be included (min_count).

Texts

For the corpora, I chose to use 19th century novels written in English. I arbitrarily used this list of “best” novels written in English:

https://www.theguardian.com/books/2015/aug/17/the-100-best-novels-written-in-english-the-full-list

From that list, I excluded children’s books (Alice’s Adventures in Wonderland, Little Women, and Huckleberry Finn), which leaves the following texts:

    1. Emma by Jane Austen (1816)
    1. Frankenstein by Mary Shelley (1818)
    1. Nightmare Abbey by Thomas Love Peacock (1818)
    1. The Narrative of Arthur Gordon Pym of Nantucket by Edgar Allan Poe (1838)
    1. Sybil by Benjamin Disraeli (1845)
    1. Jane Eyre by Charlotte Brontë (1847)
    1. Wuthering Heights by Emily Brontë (1847)
    1. Vanity Fair by William Thackeray (1848)
    1. David Copperfield by Charles Dickens (1850)
    1. The Scarlet Letter by Nathaniel Hawthorne (1850)
    1. Moby-Dick by Herman Melville (1851)
    1. The Moonstone by Wilkie Collins (1868)
    1. Middlemarch by George Eliot (1871-2)
    1. The Way We Live Now by Anthony Trollope (1875)
    1. Kidnapped by Robert Louis Stevenson (1886)
    1. Three Men in a Boat by Jerome K Jerome (1889)
    1. The Sign of Four by Arthur Conan Doyle (1890)
    1. The Picture of Dorian Gray by Oscar Wilde (1891)
    1. New Grub Street by George Gissing (1891)
    1. Jude the Obscure by Thomas Hardy (1895)
    1. The Red Badge of Courage by Stephen Crane (1895)
    1. Dracula by Bram Stoker (1897)
    1. Heart of Darkness by Joseph Conrad (1899)

I used the most recent versions of the texts from Project Gutenberg, with the following manual processing:

  • Convert to UTF-8
  • LF ends of line
  • Remove Project Gutenberg transcriber material.
  • Keep Title, TOC, but not editor notes.
  • Remove editorial, but not authorial footnotes.
  • Remove transcriber annotations for illustrations, but keep original captions

Of those 23 books, I used 20 for testing, and held out 3 for potential evaluation: the third shortest (The Sign of the Four), the median (Jude the Obscure), and the third longest (Middlemarch). That makes an 87%-13% test/evaluation split. In this post, I discuss only the 20 testing texts.

Vectors and evaluation

To create the ppmi_svd models, I used the hyperwords [6] package, which I ported to python3. To create the other models I used gensim [7]. All the evaluations were done using gensim, skipping unknown words (dummy4unknown=False), with additional custom code for calculating the sF1 score (see below):

vecs.evaluate_word_pairs(pairs, restrict_vocab=len(vecs.vocab), case_insensitive=True, dummy4unknown=False)

In creating the ppmi_svd vectors, there are two other hyperparameters to set. Following [1], I set Context Distribution Smoothing (cds) to 0.75. As well, to prevent subsampling (which introduces an element of randomness), I set sub to 0. In addition, in order to use gensim to evaluate the svd_ppmi vectors created by hyperwords, they first had to be converted to the word2vec format.

As discussed in the evaluation measure post, small corpora do not contain all the words of the evaluation sets, which greatly affects recall and which makes the Spearman \(\rho\) measure of correlation less useful on its own. In order to have a single measure which combines recall and Spearman \(\rho\), we use the analogue of the F1 measure, scaling \(\rho\) (\(\rho'\)) to be in the range [0,1] and using that in place of precision. The result is the sF1 score:

\[sF1 = \frac{2 * \rho' * recall}{\rho' + recall}\]

Based on the literature and on a small pilot test, I tested these parameter settings:

  • min_count: 1, 3, 5, 10, 20
  • win: 2, 5, 10, 20
  • dim: 25, 50, 100, 200, 400

The testing

# Load libraries
library(readr)
library(tidyverse)

There are a few cases where the recall is 0 or 1, and hence Spearman’s rho, and by extension, sF1, are not defined. We’ll remove those cases from further consideration.

eval_texts <- list('sign4','jude','midmarch')
test_names <- list("ws353","ws353_similarity","ws353_relatedness","bruni_men")
#Load data
types_tokens <- read_delim("types_tokens.csv", "\t", escape_double = FALSE, trim_ws = TRUE) %>%
  mutate(ttr = types/tokens)
SVD_sim_tests <- read_delim("testsets_parameters/SVD-sim_tests.csv", "\t", escape_double = FALSE, trim_ws = TRUE) %>%
  mutate(method="ppmi_svd")
word2vec_sim_tests <- read_delim("testsets_parameters/word2vec-sim_tests.csv", "\t", escape_double = FALSE, trim_ws = TRUE) %>%
  mutate(method="word2vec")
FastText_sim_tests <- read_delim("testsets_parameters/FastText-sim_tests.csv", "\t", escape_double = FALSE, trim_ws = TRUE) %>%
  mutate(method="FastText")
sim_tests <- rbind(SVD_sim_tests,word2vec_sim_tests,FastText_sim_tests) %>%
  inner_join(types_tokens) %>%
  rename(min_count=thresh) %>% 
  filter(! is.nan(sF1))
sim_evals <- sim_tests %>%
  filter(text %in% eval_texts)
sim_tests <- sim_tests %>%
  filter(!(text %in% eval_texts))
rm(SVD_sim_tests,word2vec_sim_tests,FastText_sim_tests)

We’ll look at the minimum count first.

sim_tests %>%
  ggplot() + theme_classic() + labs(title=paste("min_count vs sF1")) +
  geom_point(aes(min_count,sF1,color=text)) +
  geom_line(aes(min_count,sF1,color=text)) +
  facet_wrap(testset ~ method, ncol=3)

The best sF1 scores are always with a min_count of 1, i.e. including all words. This is in contrast to the standard practice with large corpora of eliminating low frequency words (a min_count of 100 is common). One possible reason for the success of including all the words is that it gives us more information to work with.

From here on, we’ll work only with the subset of parameter settings where min_count is 1.

sim_tests_1 <- sim_tests %>% filter(min_count==1)

Next we can look at the window and dimension sizes, restricting our attention to the ppmi_svd method. The best scores are distributed across win + dim combinates. The same exact distribution holds for all the testsets.

sim_tests_1 %>% filter(method=='ppmi_svd') %>%
  group_by(testset,text) %>%
  filter(sF1==max(sF1)) %>%
  ungroup() %>%
  group_by(testset,win,dim) %>%
  summarise(counts=n()) %>%
  spread(testset,counts) %>%
  arrange(-ws353,win,dim)

The amount of variation in sF1 score for a given text is fairly small, both across testsets and across parameter settings. The greatest variation in sF1 scores is across different texts.

ranges <- sim_tests_1 %>% 
  group_by(testset,text) %>%
  mutate(best=max(sF1), worst=min(sF1), range=best-worst) %>% 
  select(testset,text,best,worst,range) %>%
  unique()
bm <- filter(ranges, testset=='bruni_men') %>% arrange(range)
ranges$text <- factor(ranges$text, levels = bm$text[order(-bm$range)])
ranges %>%
  ggplot() + theme_classic() + labs(title="Ranges of sF1 scores by testset") +
  theme(axis.text.x=element_text(angle=45,hjust=1,vjust=1)) +
  geom_linerange(aes(x=text,ymin=worst,ymax=best,color=text), show.legend = FALSE) +
  facet_wrap(~testset)

ranges %>% 
  ungroup() %>%
  group_by(text) %>%
  mutate(best_sF1=max(best),worst_sF1=min(worst),sF1_range=best_sF1-worst_sF1) %>%
  select(text,best_sF1,worst_sF1,sF1_range) %>% 
  unique() %>%
  arrange(-sF1_range)

We can use linear regression for another take on the issue of parameters as predictors of the sF1 score. When we do that, we see that win and dim are not significant predictors of the sF1 score, and neither is the testset. Only the inherent properties of the text (type, tokens, and their ratio) are significant predictors.

m <- lm(sF1 ~ win*dim + tokens + types + ttr + testset,
        data=filter(sim_tests_1,method=='ppmi_svd'))
print(anova(m))
Analysis of Variance Table

Response: sF1
            Df Sum Sq Mean Sq   F value Pr(>F)    
win          1 0.0048  0.0048    3.3969 0.0655 .  
dim          1 0.0002  0.0002    0.1694 0.6807    
tokens       1 5.8259  5.8259 4153.1802 <2e-16 ***
types        1 2.3592  2.3592 1681.7929 <2e-16 ***
ttr          1 0.5371  0.5371  382.9086 <2e-16 ***
testset      3 0.0000  0.0000    0.0068 0.9992    
win:dim      1 0.0002  0.0002    0.1547 0.6941    
Residuals 1590 2.2304  0.0014                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
pts <- data.frame(predicted=fitted(m), residuals=residuals(m))

ggplot(pts) + theme_classic() +
  labs(title="Predicted vs residuals") + 
  geom_point(aes(predicted,residuals))

We look at how tokens and types and their ratio are related to sF1. For simplicity, we’ll limit ourselves just to bruni_men.

sim_tests_1 %>% filter(testset=='bruni_men') %>%
  ggplot() + theme_classic() + labs(title=paste("tokens vs sF1 for bruni_men")) +
  scale_x_continuous(labels = scales::comma) +
  geom_point(aes(tokens,sF1,color=text)) +
  geom_smooth(aes(tokens,sF1), color='orange', method='loess')

sim_tests_1 %>% 
  filter(testset=='bruni_men') %>%
  ggplot() + theme_classic() + labs(title=paste("types vs sF1 for bruni_men")) +
  scale_x_continuous(labels = scales::comma) +
  geom_point(aes(types,sF1,color=text)) +
  geom_smooth(aes(types,sF1), color='orange', method='loess')

The connection between types (vocabulary size) and sF1 score is not surprising, since the vocabulary size affects recall, and recall is the dominant aspect of the sF1 score for good scores (remember that we’re only considering settings with a min_count of 1, since those produced the best scores).

sim_tests_1 %>% 
  filter(testset=='bruni_men') %>%
  ggplot() + theme_classic() + labs(title=paste("recall vs sF1 for bruni_men")) +
  geom_point(aes(recall,sF1,color=text,shape=testset)) +
  geom_smooth(aes(recall,sF1), color='orange', method='loess')

In fact, for small corpora, the Spearman \(\rho\) alone is not a great indicator of quality.

sim_tests_1 %>% 
  filter(testset=='bruni_men') %>%
  ggplot() + theme_classic() + labs(title=paste("spearman vs sF1 for bruni_men")) +
  geom_point(aes(spearman,sF1,color=text,shape=testset))

Finally we can look at the type/token ratio versus the sF1 score. We see that the trend has a distinct rise until a ttr of about 0.06, and then an almost linear decline.

sim_tests_1 %>% 
  filter(testset=='bruni_men') %>%
  ggplot() + theme_classic() + labs(title=paste("type/token ratio vs sF1 for bruni_men")) +
  geom_point(aes(ttr,sF1,color=text)) +
  geom_smooth(aes(ttr,sF1), color='orange', method='loess')

Of the those three inherent properties, the number of tokens (i.e. vocabulary size) shows the most straightfoward connection with sF1.

Given that win and dim are important for large corpora, we should wonder at what point they do become significant. We can divide our testset into 3 rough groups based on their vocabulary size, using 8000 and 12000 tokens as the dividing lines:

texts_grouped <- types_tokens %>% 
  filter(!(text %in% eval_texts)) %>%
  mutate(size_group=if_else(types<8000,"small",
                                           if_else(types<12000,"medium","large"))) 
texts_grouped %>%
  select(size_group,text,types) %>%
  arrange(types,text)

When we do the regression models, it turns out neither win nor dim are significant for the small texts; win is slightly (p<0.05) significant for the medium texts; and both win and dim are highly significant for the large texts.

#see if win/dim are signifcant with <8000 types
m8k <- lm(sF1 ~ win*dim + tokens + types + ttr + testset,
        data=filter(sim_tests_1,method=='ppmi_svd',types<8000))
print(anova(m8k))
Analysis of Variance Table

Response: sF1
           Df  Sum Sq Mean Sq  F value Pr(>F)    
win         1 0.00031 0.00031   0.2598 0.6104    
dim         1 0.00012 0.00012   0.1029 0.7485    
tokens      1 0.13750 0.13750 115.0901 <2e-16 ***
types       1 0.86804 0.86804 726.5663 <2e-16 ***
ttr         1 0.28799 0.28799 241.0540 <2e-16 ***
testset     3 0.00009 0.00003   0.0253 0.9945    
win:dim     1 0.00001 0.00001   0.0102 0.9194    
Residuals 710 0.84825 0.00119                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#see if win/dim are signifcant when [8000,12000] types
m8_12k <- lm(sF1 ~ win*dim + tokens + types + ttr + testset,
        data=filter(sim_tests_1,method=='ppmi_svd',types>8000 & types<12000))
print(anova(m8_12k))
Analysis of Variance Table

Response: sF1
           Df   Sum Sq  Mean Sq  F value    Pr(>F)    
win         1 0.001441 0.001441   4.2805   0.03902 *  
dim         1 0.000024 0.000024   0.0728   0.78747    
tokens      1 0.016835 0.016835  50.0093 4.668e-12 ***
types       1 0.017981 0.017981  53.4123 9.575e-13 ***
ttr         1 0.288872 0.288872 858.1030 < 2.2e-16 ***
testset     3 0.000005 0.000002   0.0046   0.99958    
win:dim     1 0.000359 0.000359   1.0664   0.30221    
Residuals 550 0.185152 0.000337                       
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#see if win/dim are signifcant when >12000 types
sim_tests_12k <- filter(sim_tests_1,method=='ppmi_svd',types>12000)
m12k <- lm(sF1 ~ win*dim + tokens + types + ttr + testset,
        data=sim_tests_12k)
print(anova(m12k))
Analysis of Variance Table

Response: sF1
           Df    Sum Sq   Mean Sq  F value    Pr(>F)    
win         1 0.0060388 0.0060388 119.0324 < 2.2e-16 ***
dim         1 0.0019847 0.0019847  39.1220 1.319e-09 ***
tokens      1 0.0211591 0.0211591 417.0751 < 2.2e-16 ***
types       1 0.0010636 0.0010636  20.9654 6.780e-06 ***
ttr         1 0.0071706 0.0071706 141.3417 < 2.2e-16 ***
testset     3 0.0000132 0.0000044   0.0866   0.96734    
win:dim     1 0.0001723 0.0001723   3.3960   0.06631 .  
Residuals 310 0.0157270 0.0000507                       
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The best parameters for the four largest corpora

sim_tests_12k %>%
  group_by(text) %>%
  filter(sF1==max(sF1)) %>%
  select(text,tokens,sF1,spearman,recall,win,dim) %>%
  arrange(-tokens)

Even though win and dim are significant, there is no obvious pattern. There are only 4 “large” texts; even so, the boundary of where win and dim become significant, as well as any patterns in their values, are worth exploring further.

Comparing methods

As I mentioned at the beginning, I haven’t done the full complement of parameters acorss all the texts for word2vec and FastText. However, we can compare Vanity Fair (a long text) and Heart of Darkness (a short text) across all 3 methods. We already saw above the min_count=1 is always the best across the methods, so again, we’ll limit our attention to that subset.

sims_vfhd <- sim_tests %>% filter(text=='vfair' | text=='heartd', min_count==1)

Doing a simplified regression model, we see that method is significant, at least with these two texts, as is dim.

m_vfhd <- lm(sF1 ~ win*dim +  tokens + testset + method,
        data=sims_vfhd)
print(anova(m_vfhd))
Analysis of Variance Table

Response: sF1
           Df Sum Sq Mean Sq    F value    Pr(>F)    
win         1 0.0000  0.0000     0.0030    0.9566    
dim         1 0.0012  0.0012    20.4999 7.582e-06 ***
tokens      1 5.1859  5.1859 91909.7900 < 2.2e-16 ***
testset     3 0.0000  0.0000     0.1048    0.9572    
method      2 0.0052  0.0026    45.9455 < 2.2e-16 ***
win:dim     1 0.0000  0.0000     0.0093    0.9233    
Residuals 466 0.0263  0.0001                         
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Taking just the best sF1 scores and using the bruni_men testset as an example, we have the following table.

Comparing methods for Vanity Fair and Heart of Darkness

sims_vfhd %>% filter(testset=="bruni_men") %>%
  group_by(method,text) %>%
  filter(sF1==max(sF1)) %>%
  select(text,sF1,method,dim) %>%
  arrange(-sF1,method,text,dim)

With respect to method, ppmi_svd has its highest sF1 score for the small Heart of Darkness, and its lowest for Vanity Fair, while FastText is the opposite.

For dimensions, FastText has its highest sF1 score with a small number of dimensions; ppmi_svd has its highest sF1 score with a larger number of dimensions; and word2vec is mixed.

Of course, it would be good to have a full complement of sF1 scores for different corpus sizes, but at the very least we can tell already that there is no commonality in the parameter settings across the methods.

Discussion and Conclusion

In the spirit of [1], here are a few “lessons learned”:

Now that we’ve gone through these evaluations with the word similarity testsets, it is worth questioning how useful these testsets are for evaluating small corpora. First, we saw above that the vocabulary size is very strongly associated with the sF1 score. Second, we can ask whether these testsets are even relevant, especially in the context of inquiries into authors’ use of words.

The next post, Exploring similarities, will explore a different perspective on parameter settings.

Back to the introduction

Other posts

References

[1] Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving Distributional Similarity with Lessons Learned from Word Embeddings. Transactions of the Association for Computational Linguistics, vol. 3, pp. 211–225.

[2] Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2002. Placing search in context: The concept revisited. ACM Transactions on Information Systems, 20(1):116–131.

[3] Torsten Zesch, Christof Müller, and Iryna Gurevych. 2008. Using wiktionary for computing semantic relatedness. In Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 2, AAAI’08, 861–866. AAAI Press.

[4] Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pasca, and Aitor Soroa. 2009. A study on similarity and relatedness using distributional and wordnet-based approaches. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 19–27, Boulder, Colorado, June. Association for Computational Linguistics.

[5] Elia Bruni, Gemma Boleda, Marco Baroni, and Nam Khanh Tran. 2012. Distributional semantics in technicolor. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 136–145, Jeju Island, Korea, July. Association for Computational Linguistics.

[6] Hyperwords: https://bitbucket.org/omerlevy/hyperwords, published as 1

[7] Gensim: https://radimrehurek.com/gensim/, published as: Software Framework for Topic Modelling with Large Corpora. Radim Řehůřek and Petr Sojka, Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45-50, 22 May 2010.

---
title: "Word vectors with small corpora:<br>Finding useful parameters and methods"
output: html_notebook
fig_caption: yes
---

## © 2018 Chris Culy, March 2018
### [chrisculy.net](http://chrisculy.net/)

## Overview
This is one of a [series of posts](wvecs_intro.html) on using word vectors with small corpora. In this post I discuss one approach to finding useful parameters and methods, by evaluating word vector models against standard testbeds. 

The surprising results are that including all words is more effective than eliminating low frequency words, and that the choice of window size and the number of dimensions is not significant for these small corpora.

## Background

In the previous sections ([Stabilizing randomness](wvecs_random_fix.html) and [A new measure for evaluation](wvecs_evaluation_measure.html)), we have discussed three different word vector models: positive point-wise mutual information + SVD (ppmi_svd), word2vec, and FastText. There are various parameters to set when constructing a word vector model. Therefore, we would like to know which models and which settings will be the most useful.

As mentioned in the [evaluation post](wvecs_evaluation_measure.html), one way of evaluating word vectors is to apply them to tasks that people do, and compare those results to people's results. Two common tasks are similarity (or relatedness) and analogies. For similarities, the task is to judge how similar (or related) two words are. For analogies, the task is to fill in the missing term in a series "A is to B as C is to —". In this post, I will focus on similarity, and following [[1]](#ref1), I will use four standard testsets: _ws353_ [[2]](#ref2), _ws353_similarity_ [[3]](#ref3), _ws353_relatedness_ [[4]](#ref4), and _bruni_men_ [[5]](#ref5). 

## Methodology
### Approach
My general approach is similar to that in [[1]](#ref1), namely try all combinations of some set of parameter values (i.e. a grid search), evaluating each one with each model. However, since the centroid models are significantly slower to calculate than the ppmi_svd models, I will focus on the ppmi_svd models. As an example of the relative times, here are the times to do the 100 combinations of parameters discussed below applied to the novel _Vanity Fair_, using a 2.9 GHz Intel Core i7 laptop with 8GB of RAM:

* ppmi_svd: ~ 2 hours
* word2vec: ~ 5 hours
*	FastText: ~ 14.75 hours

While there are multiple parameters to set when setting up a word vector model, I will focus here on the parameters of window size (_win_), number of dimensions (_dim_), and the minimal count for items to be included (_min_count_).

### Texts
For the corpora, I chose to use 19th century novels written in English. I arbitrarily used this list of "best" novels written in English:

[https://www.theguardian.com/books/2015/aug/17/the-100-best-novels-written-in-english-the-full-list](https://www.theguardian.com/books/2015/aug/17/the-100-best-novels-written-in-english-the-full-list)

From that list, I excluded children's books (_Alice's Adventures in Wonderland_, _Little Women_, and _Huckleberry Finn_), which leaves the following texts:

* 7. Emma by Jane Austen (1816)
* 8. Frankenstein by Mary Shelley (1818)
* 9. Nightmare Abbey by Thomas Love Peacock (1818)
* 10. The Narrative of Arthur Gordon Pym of Nantucket by Edgar Allan Poe (1838)
* 11. Sybil by Benjamin Disraeli (1845)
* 12. Jane Eyre by Charlotte Brontë (1847)
* 13. Wuthering Heights by Emily Brontë (1847)
* 14. Vanity Fair by William Thackeray (1848)
* 15. David Copperfield by Charles Dickens (1850)
* 16. The Scarlet Letter by Nathaniel Hawthorne (1850)
* 17. Moby-Dick by Herman Melville (1851)
* 19. The Moonstone by Wilkie Collins (1868)
* 21. Middlemarch by George Eliot (1871-2)
* 22. The Way We Live Now by Anthony Trollope (1875)
* 24. Kidnapped by Robert Louis Stevenson (1886)
* 25. Three Men in a Boat by Jerome K Jerome (1889)
* 26. The Sign of Four by Arthur Conan Doyle (1890)
* 27. The Picture of Dorian Gray by Oscar Wilde (1891)
* 28. New Grub Street by George Gissing (1891)
* 29. Jude the Obscure by Thomas Hardy (1895)
* 30. The Red Badge of Courage by Stephen Crane (1895)
* 31. Dracula by Bram Stoker (1897)
* 32. Heart of Darkness by Joseph Conrad (1899)

I used the most recent versions of the texts from [Project Gutenberg](https://www.gutenberg.org), with the following manual processing:

* Convert to UTF-8
* LF ends of line
*	Remove Project Gutenberg transcriber material. 
* Keep Title, TOC, but not editor notes. 
*	Remove editorial, but not authorial footnotes.
* Remove transcriber annotations for illustrations, but keep original captions

Of those 23 books, I used 20 for testing, and held out 3 for potential evaluation: the third shortest (_The Sign of the Four_), the median (_Jude the Obscure_), and the third longest (_Middlemarch_). That makes an 87%-13% test/evaluation split. In this post, I discuss only the 20 testing texts.

### Vectors and evaluation

To create the ppmi_svd models, I used the hyperwords [[6]](#ref6) package, which I ported to python3. To create the other models I used gensim [[7]](#ref7). All the evaluations were done using gensim, skipping unknown words (` dummy4unknown=False `), with additional custom code for calculating the sF1 score (see below):

` vecs.evaluate_word_pairs(pairs, restrict_vocab=len(vecs.vocab), case_insensitive=True, dummy4unknown=False) `  

In creating the ppmi_svd vectors, there are two other hyperparameters to set. Following [[1]](#ref1), I set Context Distribution Smoothing (_cds_) to 0.75. As well, to  prevent subsampling (which introduces an element of randomness), I set _sub_ to 0. In addition, in order to use gensim to evaluate the svd_ppmi vectors created by hyperwords, they first had to be converted to the  word2vec format.

As discussed in the [evaluation measure](wvecs_evaluation_measure.html) post, small corpora do not contain all the words of the evaluation sets, which greatly affects recall and which makes the Spearman $\rho$ measure of correlation less useful on its own. In order to have a single measure which combines recall and Spearman $\rho$, we use the analogue of the F1 measure, scaling $\rho$ ($\rho'$) to be in the range [0,1] and using that in place of precision. The result is the _sF1_ score:

  $$sF1 = \frac{2 * \rho' * recall}{\rho' + recall}$$

Based on the literature and on a small pilot test, I tested these parameter settings:

* min_count: 1, 3, 5, 10, 20
* win: 2, 5, 10, 20
* dim: 25, 50, 100, 200, 400

## The testing

```{r message=FALSE}
# Load libraries
library(readr)
library(tidyverse)
```

There are a few cases where the recall is 0 or 1, and hence Spearman's rho, and by extension, sF1, are not defined. We'll remove those cases from further consideration.

```{r message=FALSE}

eval_texts <- list('sign4','jude','midmarch')
test_names <- list("ws353","ws353_similarity","ws353_relatedness","bruni_men")

#Load data
types_tokens <- read_delim("types_tokens.csv", "\t", escape_double = FALSE, trim_ws = TRUE) %>%
  mutate(ttr = types/tokens)

SVD_sim_tests <- read_delim("testsets_parameters/SVD-sim_tests.csv", "\t", escape_double = FALSE, trim_ws = TRUE) %>%
  mutate(method="ppmi_svd")

word2vec_sim_tests <- read_delim("testsets_parameters/word2vec-sim_tests.csv", "\t", escape_double = FALSE, trim_ws = TRUE) %>%
  mutate(method="word2vec")

FastText_sim_tests <- read_delim("testsets_parameters/FastText-sim_tests.csv", "\t", escape_double = FALSE, trim_ws = TRUE) %>%
  mutate(method="FastText")

sim_tests <- rbind(SVD_sim_tests,word2vec_sim_tests,FastText_sim_tests) %>%
  inner_join(types_tokens) %>%
  rename(min_count=thresh) %>% 
  filter(! is.nan(sF1))


sim_evals <- sim_tests %>%
  filter(text %in% eval_texts)

sim_tests <- sim_tests %>%
  filter(!(text %in% eval_texts))

rm(SVD_sim_tests,word2vec_sim_tests,FastText_sim_tests)

```

We'll look at the minimum count first.

```{r fig.height=8, width=12}
sim_tests %>%
  ggplot() + theme_classic() + labs(title=paste("min_count vs sF1")) +
  geom_point(aes(min_count,sF1,color=text)) +
  geom_line(aes(min_count,sF1,color=text)) +
  facet_wrap(testset ~ method, ncol=3)
```

The best sF1 scores are always with a min_count of 1, i.e. including all words. This is in contrast to the standard practice with large corpora of eliminating low frequency words (a min_count of 100 is common). One possible reason for the success of including all the words is that it gives us more information to work with.

From here on, we'll work only with the subset of parameter settings where min_count is 1.

```{r}
sim_tests_1 <- sim_tests %>% filter(min_count==1)
```


Next we can look at the window and dimension sizes, restricting our attention to the ppmi_svd method. The best scores are distributed across win + dim combinates. The same exact distribution holds for all the testsets.

```{r}
sim_tests_1 %>% filter(method=='ppmi_svd') %>%
  group_by(testset,text) %>%
  filter(sF1==max(sF1)) %>%
  ungroup() %>%
  group_by(testset,win,dim) %>%
  summarise(counts=n()) %>%
  spread(testset,counts) %>%
  arrange(-ws353,win,dim)
```


The amount of variation in sF1 score for a given text is fairly small, both across testsets and across parameter settings. The greatest variation in sF1 scores is across different texts.

```{r fig.height=6, fig.width=8}
ranges <- sim_tests_1 %>% 
  group_by(testset,text) %>%
  mutate(best=max(sF1), worst=min(sF1), range=best-worst) %>% 
  select(testset,text,best,worst,range) %>%
  unique()

bm <- filter(ranges, testset=='bruni_men') %>% arrange(range)
ranges$text <- factor(ranges$text, levels = bm$text[order(-bm$range)])

ranges %>%
  ggplot() + theme_classic() + labs(title="Ranges of sF1 scores by testset") +
  theme(axis.text.x=element_text(angle=45,hjust=1,vjust=1)) +
  geom_linerange(aes(x=text,ymin=worst,ymax=best,color=text), show.legend = FALSE) +
  facet_wrap(~testset)
```

```{r}
ranges %>% 
  ungroup() %>%
  group_by(text) %>%
  mutate(best_sF1=max(best),worst_sF1=min(worst),sF1_range=best_sF1-worst_sF1) %>%
  select(text,best_sF1,worst_sF1,sF1_range) %>% 
  unique() %>%
  arrange(-sF1_range)
```


We can use linear regression for another take on the issue of parameters as predictors of the sF1 score. When we do that, we see that win and dim are not significant predictors of the sF1 score, and neither is the testset. Only the inherent properties of the text (type, tokens, and their ratio) are significant predictors.


```{r}
m <- lm(sF1 ~ win*dim + tokens + types + ttr + testset,
        data=filter(sim_tests_1,method=='ppmi_svd'))

print(anova(m))
```

```{r messsage=FALSE, eval=FALSE}
pts <- data.frame(predicted=fitted(m), residuals=residuals(m))

ggplot(pts) + theme_classic() +
  labs(title="Predicted vs residuals") + 
  geom_point(aes(predicted,residuals))

```

We look at how tokens and types and their ratio are related to sF1. For simplicity, we'll limit ourselves just to bruni_men.

```{r fig.height=6}

sim_tests_1 %>% filter(testset=='bruni_men') %>%
  ggplot() + theme_classic() + labs(title=paste("tokens vs sF1 for bruni_men")) +
  scale_x_continuous(labels = scales::comma) +
  geom_point(aes(tokens,sF1,color=text)) +
  geom_smooth(aes(tokens,sF1), color='orange', method='loess')
```

```{r fig.height=6}
sim_tests_1 %>% 
  filter(testset=='bruni_men') %>%
  ggplot() + theme_classic() + labs(title=paste("types vs sF1 for bruni_men")) +
  scale_x_continuous(labels = scales::comma) +
  geom_point(aes(types,sF1,color=text)) +
  geom_smooth(aes(types,sF1), color='orange', method='loess')
```

The connection between types (vocabulary size) and sF1 score is not surprising, since the vocabulary size affects recall, and recall is the dominant aspect of the sF1 score for good scores (remember that we're only considering settings with a min_count of 1, since those produced the best scores).

```{r fig.height=6}
sim_tests_1 %>% 
  filter(testset=='bruni_men') %>%
  ggplot() + theme_classic() + labs(title=paste("recall vs sF1 for bruni_men")) +
  geom_point(aes(recall,sF1,color=text,shape=testset)) +
  geom_smooth(aes(recall,sF1), color='orange', method='loess')
```

In fact, for small corpora, the Spearman $\rho$ _alone_ is not a great indicator of quality.

```{r fig.height=6}
sim_tests_1 %>% 
  filter(testset=='bruni_men') %>%
  ggplot() + theme_classic() + labs(title=paste("spearman vs sF1 for bruni_men")) +
  geom_point(aes(spearman,sF1,color=text,shape=testset))
```

Finally we can look at the type/token ratio versus the sF1 score. We see that the trend has a distinct rise until a ttr of about 0.06, and then an almost linear decline.

```{r fig.height=6}
sim_tests_1 %>% 
  filter(testset=='bruni_men') %>%
  ggplot() + theme_classic() + labs(title=paste("type/token ratio vs sF1 for bruni_men")) +
  geom_point(aes(ttr,sF1,color=text)) +
  geom_smooth(aes(ttr,sF1), color='orange', method='loess')
```

Of the those three inherent properties, the number of tokens (i.e. vocabulary size) shows the most straightfoward connection with sF1.

Given that win and dim _are_ important for large corpora, we should wonder at what point they _do_ become significant. We can divide our testset into 3 rough groups based on their vocabulary size, using 8000 and 12000 tokens as the dividing lines:

```{r}
texts_grouped <- types_tokens %>% 
  filter(!(text %in% eval_texts)) %>%
  mutate(size_group=if_else(types<8000,"small",
                                           if_else(types<12000,"medium","large"))) 

texts_grouped %>%
  select(size_group,text,types) %>%
  arrange(types,text)
```


When we do the regression models, it turns out neither win nor dim are  significant for the small texts; win is _slightly_ (p<0.05) significant for the medium texts; and both win and dim are highly significant for the large texts. 

```{r}
#see if win/dim are signifcant with <8000 types
m8k <- lm(sF1 ~ win*dim + tokens + types + ttr + testset,
        data=filter(sim_tests_1,method=='ppmi_svd',types<8000))

print(anova(m8k))
```

```{r}
#see if win/dim are signifcant when [8000,12000] types
m8_12k <- lm(sF1 ~ win*dim + tokens + types + ttr + testset,
        data=filter(sim_tests_1,method=='ppmi_svd',types>8000 & types<12000))

print(anova(m8_12k))
```

```{r}
#see if win/dim are signifcant when >12000 types

sim_tests_12k <- filter(sim_tests_1,method=='ppmi_svd',types>12000)

m12k <- lm(sF1 ~ win*dim + tokens + types + ttr + testset,
        data=sim_tests_12k)

print(anova(m12k))
```

### The best parameters for the four largest corpora
```{r}
sim_tests_12k %>%
  group_by(text) %>%
  filter(sF1==max(sF1)) %>%
  select(text,tokens,sF1,spearman,recall,win,dim) %>%
  arrange(-tokens)
```
Even though win and dim are significant, there is no obvious pattern. There are only 4 "large" texts; even so, the boundary of where win and dim become significant, as well as any patterns in their values, are worth exploring further.

## Comparing methods

As I mentioned at the beginning, I haven't done the full complement of parameters acorss all the texts for word2vec and FastText. However, we can compare _Vanity Fair_ (a long text) and _Heart of Darkness_ (a short text) across all 3 methods. We already saw above the min_count=1 is always the best across the methods, so again, we'll limit our attention to that subset.

```{r}
sims_vfhd <- sim_tests %>% filter(text=='vfair' | text=='heartd', min_count==1)
```

Doing a simplified regression model, we see that _method_ is significant, at least with these two texts, as is _dim_.


```{r}
m_vfhd <- lm(sF1 ~ win*dim +  tokens + testset + method,
        data=sims_vfhd)

print(anova(m_vfhd))
```

Taking just the best sF1 scores and using the bruni_men testset as an example, we have the following table. 

### Comparing methods for _Vanity Fair_ and _Heart of Darkness_
```{r fig.height=3}
sims_vfhd %>% filter(testset=="bruni_men") %>%
  group_by(method,text) %>%
  filter(sF1==max(sF1)) %>%
  select(text,sF1,method,dim) %>%
  arrange(-sF1,method,text,dim)
```
With respect to method, ppmi_svd has its highest sF1 score for the small _Heart of Darkness_, and its lowest for _Vanity Fair_, while FastText is the opposite. 

For dimensions, FastText has its highest sF1 score with a small number of dimensions; ppmi_svd has its highest sF1 score with a larger number of dimensions; and word2vec is mixed.

Of course, it would be good to have a full complement of sF1 scores for different corpus sizes, but at the very least we can tell already that there is no commonality in the parameter settings across the methods.

## Discussion and Conclusion

In the spirit of [[1]](#ref1), here are a few "lessons learned":

* Including all the words (min_count=1) is the best strategy for all small corpora
* For the smallest and relatively small corpora (< 12000 _types_), the choices of window size and number of dimensions are not significant
* The choice of which method to use will depend partly on corpus size, but more on computational constraints if the centroid technique is used. In addition, testing on the specific corpus of interest is relevant.

Now that we've gone through these evaluations with the word similarity testsets, it is worth questioning how useful these testsets are for evaluating small corpora. First, we saw above that the vocabulary size is very strongly associated with the sF1 score. Second, we can ask whether these testsets are even relevant, especially in the context of inquiries into authors' use of words. 

The next post, [Exploring similarities](wvecs_exploring_similarities.nb.html), will explore a different perspective on parameter settings.

[Back to the introduction](wvecs_intro.html)

Other posts

* [Stabilizing randomness](wvecs_random_fix.html)
* [A new measure for evaluation](wvecs_evaluation_measure.html)
* [Exploring similarities](wvecs_exploring_similarities.nb.html)
* [Visualizing word vectors](wvecs_visualization.html)

## References

<span id="ref1">[1]</span> Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving Distributional Similarity with Lessons Learned from Word Embeddings. Transactions of the Association for Computational Linguistics, vol. 3, pp. 211–225.

<span id="ref2">[2]</span> Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2002. Placing search in context: The concept revisited. ACM Transactions on Information Systems, 20(1):116–131.

<span id="ref3">[3]</span> Torsten Zesch, Christof Müller, and Iryna Gurevych. 2008. Using wiktionary for computing semantic relatedness. In Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 2, AAAI’08, 861–866. AAAI Press.

<span id="ref4">[4]</span> Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pasca, and Aitor Soroa. 2009. A study on similarity and relatedness using distributional and wordnet-based approaches. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 19–27, Boulder, Colorado, June. Association for Computational Linguistics.

<span id="ref5">[5]</span> Elia Bruni, Gemma Boleda, Marco Baroni, and Nam Khanh Tran. 2012. Distributional semantics in technicolor. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 136–145, Jeju Island, Korea, July. Association for Computational Linguistics.

<span id="ref6">[6]</span> Hyperwords: https://bitbucket.org/omerlevy/hyperwords, published as [1](#ref1)

<span id="ref7">[7]</span> Gensim: https://radimrehurek.com/gensim/, published as: Software Framework for Topic Modelling with Large Corpora. Radim Řehůřek and Petr Sojka, Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45-50, 22 May 2010.

