© 2018 Chris Culy, March 2018

Overview

This is one of a series of posts on using word vectors with small corpora. In this post I discuss one approach to finding useful parameters and methods, by evaluating word vector models against standard testbeds.

The surprising results are that including all words is more effective than eliminating low frequency words, and that the choice of window size and the number of dimensions is not significant for these small corpora.

Background

In the previous sections (Stabilizing randomness and A new measure for evaluation), we have discussed three different word vector models: positive point-wise mutual information + SVD (ppmi_svd), word2vec, and FastText. There are various parameters to set when constructing a word vector model. Therefore, we would like to know which models and which settings will be the most useful.

As mentioned in the evaluation post, one way of evaluating word vectors is to apply them to tasks that people do, and compare those results to people’s results. Two common tasks are similarity (or relatedness) and analogies. For similarities, the task is to judge how similar (or related) two words are. For analogies, the task is to fill in the missing term in a series “A is to B as C is to —”. In this post, I will focus on similarity, and following [1], I will use four standard testsets: ws353 [2], ws353_similarity [3], ws353_relatedness [4], and bruni_men [5].

Methodology

Approach

My general approach is similar to that in [1], namely try all combinations of some set of parameter values (i.e. a grid search), evaluating each one with each model. However, since the centroid models are significantly slower to calculate than the ppmi_svd models, I will focus on the ppmi_svd models. As an example of the relative times, here are the times to do the 100 combinations of parameters discussed below applied to the novel Vanity Fair, using a 2.9 GHz Intel Core i7 laptop with 8GB of RAM:

  • ppmi_svd: ~ 2 hours
  • word2vec: ~ 5 hours
  • FastText: ~ 14.75 hours

While there are multiple parameters to set when setting up a word vector model, I will focus here on the parameters of window size (win), number of dimensions (dim), and the minimal count for items to be included (min_count).

Texts

For the corpora, I chose to use 19th century novels written in English. I arbitrarily used this list of “best” novels written in English:

https://www.theguardian.com/books/2015/aug/17/the-100-best-novels-written-in-english-the-full-list

From that list, I excluded children’s books (Alice’s Adventures in Wonderland, Little Women, and Huckleberry Finn), which leaves the following texts:

    1. Emma by Jane Austen (1816)
    1. Frankenstein by Mary Shelley (1818)
    1. Nightmare Abbey by Thomas Love Peacock (1818)
    1. The Narrative of Arthur Gordon Pym of Nantucket by Edgar Allan Poe (1838)
    1. Sybil by Benjamin Disraeli (1845)
    1. Jane Eyre by Charlotte Brontë (1847)
    1. Wuthering Heights by Emily Brontë (1847)
    1. Vanity Fair by William Thackeray (1848)
    1. David Copperfield by Charles Dickens (1850)
    1. The Scarlet Letter by Nathaniel Hawthorne (1850)
    1. Moby-Dick by Herman Melville (1851)
    1. The Moonstone by Wilkie Collins (1868)
    1. Middlemarch by George Eliot (1871-2)
    1. The Way We Live Now by Anthony Trollope (1875)
    1. Kidnapped by Robert Louis Stevenson (1886)
    1. Three Men in a Boat by Jerome K Jerome (1889)
    1. The Sign of Four by Arthur Conan Doyle (1890)
    1. The Picture of Dorian Gray by Oscar Wilde (1891)
    1. New Grub Street by George Gissing (1891)
    1. Jude the Obscure by Thomas Hardy (1895)
    1. The Red Badge of Courage by Stephen Crane (1895)
    1. Dracula by Bram Stoker (1897)
    1. Heart of Darkness by Joseph Conrad (1899)

I used the most recent versions of the texts from Project Gutenberg, with the following manual processing:

  • Convert to UTF-8
  • LF ends of line
  • Remove Project Gutenberg transcriber material.
  • Keep Title, TOC, but not editor notes.
  • Remove editorial, but not authorial footnotes.
  • Remove transcriber annotations for illustrations, but keep original captions

Of those 23 books, I used 20 for testing, and held out 3 for potential evaluation: the third shortest (The Sign of the Four), the median (Jude the Obscure), and the third longest (Middlemarch). That makes an 87%-13% test/evaluation split. In this post, I discuss only the 20 testing texts.

Vectors and evaluation

To create the ppmi_svd models, I used the hyperwords [6] package, which I ported to python3. To create the other models I used gensim [7]. All the evaluations were done using gensim, skipping unknown words (dummy4unknown=False), with additional custom code for calculating the sF1 score (see below):

vecs.evaluate_word_pairs(pairs, restrict_vocab=len(vecs.vocab), case_insensitive=True, dummy4unknown=False)

In creating the ppmi_svd vectors, there are two other hyperparameters to set. Following [1], I set Context Distribution Smoothing (cds) to 0.75. As well, to prevent subsampling (which introduces an element of randomness), I set sub to 0. In addition, in order to use gensim to evaluate the svd_ppmi vectors created by hyperwords, they first had to be converted to the word2vec format.

As discussed in the evaluation measure post, small corpora do not contain all the words of the evaluation sets, which greatly affects recall and which makes the Spearman \(\rho\) measure of correlation less useful on its own. In order to have a single measure which combines recall and Spearman \(\rho\), we use the analogue of the F1 measure, scaling \(\rho\) (\(\rho'\)) to be in the range [0,1] and using that in place of precision. The result is the sF1 score:

\[sF1 = \frac{2 * \rho' * recall}{\rho' + recall}\]

Based on the literature and on a small pilot test, I tested these parameter settings:

  • min_count: 1, 3, 5, 10, 20
  • win: 2, 5, 10, 20
  • dim: 25, 50, 100, 200, 400

The testing

# Load libraries
library(readr)
library(tidyverse)

There are a few cases where the recall is 0 or 1, and hence Spearman’s rho, and by extension, sF1, are not defined. We’ll remove those cases from further consideration.

eval_texts <- list('sign4','jude','midmarch')
test_names <- list("ws353","ws353_similarity","ws353_relatedness","bruni_men")
#Load data
types_tokens <- read_delim("types_tokens.csv", "\t", escape_double = FALSE, trim_ws = TRUE) %>%
  mutate(ttr = types/tokens)
SVD_sim_tests <- read_delim("testsets_parameters/SVD-sim_tests.csv", "\t", escape_double = FALSE, trim_ws = TRUE) %>%
  mutate(method="ppmi_svd")
word2vec_sim_tests <- read_delim("testsets_parameters/word2vec-sim_tests.csv", "\t", escape_double = FALSE, trim_ws = TRUE) %>%
  mutate(method="word2vec")
FastText_sim_tests <- read_delim("testsets_parameters/FastText-sim_tests.csv", "\t", escape_double = FALSE, trim_ws = TRUE) %>%
  mutate(method="FastText")
sim_tests <- rbind(SVD_sim_tests,word2vec_sim_tests,FastText_sim_tests) %>%
  inner_join(types_tokens) %>%
  rename(min_count=thresh) %>% 
  filter(! is.nan(sF1))
sim_evals <- sim_tests %>%
  filter(text %in% eval_texts)
sim_tests <- sim_tests %>%
  filter(!(text %in% eval_texts))
rm(SVD_sim_tests,word2vec_sim_tests,FastText_sim_tests)

We’ll look at the minimum count first.

sim_tests %>%
  ggplot() + theme_classic() + labs(title=paste("min_count vs sF1")) +
  geom_point(aes(min_count,sF1,color=text)) +
  geom_line(aes(min_count,sF1,color=text)) +
  facet_wrap(testset ~ method, ncol=3)