This is one of a series of posts. In this post I examine three aspects of word vectors to look for frequency effects: the vectors, the dimensions, and the nearest neighbors. The first and third aspects have been explored for other methods with a large corpus by [1]. The results here differ somewhat from those in [1], but the relation between vectors and frequency holds up.

- Vectors encode varying amounts of information about frequency (cf [1])
**new**ppmi encodes the most followed by sgns; ft and glove vary more**new**the amount of information encoded varies by the corpus**new**low frequency words encode very little frequency information, across methods

**new**Individual dimensions encode relatively little information about frequency**new**The "power law" for nearest neighbors from [1] is mostly*not*reproduced with these methods and these smaller corpora

Show Code

In [1]:

```
#imports
from dfewe import *
#for tables in Jupyter
from IPython.display import HTML, display
import tabulate
```

In [2]:

```
# some utilities
def show_title(t):
display(HTML('<b>%s</b>' % t))
def show_table(data,headers,title):
show_title(title)
display(HTML(tabulate.tabulate(data,tablefmt='html', headers=headers)))
```

In [3]:

```
#set up standard corpora + vectors
vfair_all = Setup.make_standard_sampler_and_vecs('vfair',5,100,1) #window=5, dims=100, min_count=1
heartd_all = Setup.make_standard_sampler_and_vecs('heartd',5,100,1) #window=5, dims=100, min_count=1
what = [['Vanity Fair (vfair)'],['Heart of Darkness (heartd)']]
for i,c in enumerate([vfair_all,heartd_all]):
sampler = c['sampler']
what[i].extend([sum(sampler.counts.values()), len(sampler.counts)])
show_table(what, headers=['Corpus','Tokens','Types'], title="Corpora sizes")
```

So far we have been looking at similarities and ranks, which involve 2 vectors at a time. Given that we have found frequency related effects, from shifted similarity means to stratification of similarities and ranks, we should also ask whether there are frequency effects at the level of individual vectors, and even at the level of individual dimensions. [1] found two types of vector-level frequency effects, which we will expand on here. We'll also look at whether there are frequency effects for dimensions. The results here differ somewhat from those in [1], as I'll discuss below.

To see whether word vectors are related to the frequency of their associated words, [1] constructed a logistic prediction task "to put words either in a frequent or rare category" based on the vectors, and they found that a variety of methods (Glove is the only one that overlaps with the methods here) all gave positive results. In other words, they found that word frequency is related to (or encoded in, in their terms) the word vectors.

The approach here is somewhat simpler: we simply try to fit a linear model from the vectors to the percentiles of the corresponding words. When we do that for *Vanity Fair* (*vfair*) we get the following.

In [4]:

```
def show_vectors_vs_percentiles(combo,name,lower_percentile=0,upper_percentile=100):
"""
Show results in table
"""
what,nwords = VectorCalcs.vectors_vs_percentile(combo,lower_percentile=lower_percentile,upper_percentile=upper_percentile)
show_table(what,['Method','R<sup>2</sup>'], 'Linear regression for %d vectors in percentiles %d to %d in %s'
% (nwords, lower_percentile, upper_percentile, name))
def compare_vectors_vs_percentiles(combo,name,ranges=[(0,100)]):
"""
show a table with the R^2 values for the different ranges for each method in combo
"""
"""
range, nwords, methods...
"""
what = []
for lowerp,upperp in ranges:
results,nwords = VectorCalcs.vectors_vs_percentile(combo,lower_percentile=lowerp,upper_percentile=upperp)
what.append(['%d - %d' % (lowerp,upperp),nwords] + [r[1] for r in results])
headers = ['Percentile range','Vocab','sgns','ft','glove','ppmi']
show_table(what,headers,'Linear regression R<sup>2</sup> for %s' % name)
```

In [5]:

```
compare_vectors_vs_percentiles(vfair_all,'vfair')
```

What we see is a bit different from the findings in [1] which is based on a much larger Wikipedia corpus with a vocabulary vocabulary size of 103,000+ as opposed to the 15,000+ of vfair. In particular, sgns and ppmi do suggest a relation between a vector and the relative frequency of its word, but ft and especially glove are much weaker.

Since we have seen that very low frequency words have disproportionate effect on similarties and ranks, we can probe a bit further by removing low freqency words and fitting the model to the remainder. Similarly, we can remove very high frequency words, or both.

In [6]:

```
compare_vectors_vs_percentiles(vfair_all,'vfair',[(0,100),(1,100),(5,100),(0,99),(0,95),(1,99),(5,95)])
```

We find that removing the very low frequency words increases the $R^{2}$ across the board, quite dramatically for glove, such that all four methods seem to have a relation between vectors and relative frequency. In other words, most of the vocabulary (the low frequency items) does **not** encode (very much) information about frequency, but the remaining portion does, to varying degrees by method.

While it isn't clear from [1] whether very low frequency words were omitted their tests, if they were that could explain the difference in results for glove. We can note that very high frequency words (<1% of the vocabulary) have negligible effect on the results.

Repeating the tests for *Heart of Darkness* (*heartd*), we find that the results are *much weaker, except for ppmi* which has an even higher $R^{2}$ than for vfair. This suggests that the size of the corpus is a contributing factor in how much frequency information is encoded in the vectors, whether more for ppmi or less for the others. On the other hand, we do see the same pattern as in vfair, where the very low frequency words do not encode very much frequency information.

In [7]:

```
compare_vectors_vs_percentiles(heartd_all,'heartd',[(0,100),(1,100),(5,100),(0,99),(0,95),(1,99),(5,95)])
```

*one* of the methods has an $R^{2}$ above a certain threshold.

In [8]:

```
def show_dimensions_vs_percentiles(combo,name,lower_percentile=0,upper_percentile=100, thresh=0.25):
"""
Show table of R^2 for each dimension and each method for the words in percentile range
Show only rows where at least one R^2 is above the threshold
"""
d,nwords = VectorCalcs.compare_dimensions_vs_percentiles(combo,lower_percentile=lower_percentile,upper_percentile=upper_percentile)
npd = np.array(d)
mask = (npd[:, 1] > thresh) | (npd[:, 2] > thresh) | (npd[:,3] > thresh) | (npd[:,4] > thresh)
d = list(npd[mask])
headers = ['Dimension','sgns','ft','glove','ppmi']
title = 'Linear regression R<sup>2</sup> for each dimension in %s, %d words in range %d - %d. Threshold for one dimension is %0.3f' % (name,nwords,lower_percentile, upper_percentile, thresh)
show_table(d,headers,title)
```

In [9]:

```
show_dimensions_vs_percentiles(vfair_all,'vfair', thresh=0.3)
show_dimensions_vs_percentiles(vfair_all,'vfair',lower_percentile=5,upper_percentile=100,thresh=0.3)
show_dimensions_vs_percentiles(heartd_all,'heartd',thresh=0.1)
show_dimensions_vs_percentiles(heartd_all,'heartd',lower_percentile=5,upper_percentile=100,thresh=0.1)
```

In [10]:

```
VectorCalcs.compare_ave_nn_ranks(vfair_all,'vfair')
```

In [11]:

```
VectorCalcs.compare_ave_nn_ranks(heartd_all,'heartd')
```

We can try removing the very low frequency words, as we've done before. vfair shows (roughly) linear relations for sgns and glove, and that's about it; heartd also shows a roughly linear relation for sgns, but nothing else.

In short, the nearest neighbor results from [1] are mostly not reproduced here, with much smaller corpora and different methods. While it might be worth exploring this approach further, it does not seem like it is shows an inherent property either of language or of the word embedding methods.

In [12]:

```
VectorCalcs.compare_ave_nn_ranks(vfair_all,'vfair', lower_percentile=5, upper_percentile=100)
```

In [13]:

```
VectorCalcs.compare_ave_nn_ranks(heartd_all,'heartd', lower_percentile=5, upper_percentile=100)
```