Distributional and frequency effects in word embeddings: Large corpora

© 2018 Chris Culy, August 2018

chrisculy.net

Overview

This is one of a series of posts on extending the preceding posts on frequency effects to more embedding techniques and more corpora. In this post I look at some embeddings based on large corpora, embeddings which have been widely used as sample testbeds. In particular, I will be examining the Google News embeddings, the GLoVe embeddings and the FastText English embeddings.

Results and contributions

  • new Frequency encoding is stronger for the larger corpora than for Vanity Fair
  • new Frequency stratfication tends to be stronger for Vanity Fair than for the larger corpora (except in the case of GLoVe)
  • new Frequency stratification tends to be direct for the larger corpora and indirect for Vanity Fair (except in the case of GLoVe)
  • new Powerlaw for nearest neighbors is stronger for the larger corpora than for Vanity Fair
  • Similarity skewness is moderate

Download as Jupyter notebook

Download supplemental Python code

Download summary test Python code

Show Code

In [1]:
%load_ext autoreload
%autoreload 2

#imports
from dfewe import *
from dfewe_nb.nb_utils import *
from dfewe_nb.freq_tests import run_tests as testfs
In [2]:
#to free up memory
def delete_gn():
    if 'gn_vecs' in globals():
        global gn_vecs, gn_sampler
        del gn_vecs
        del gn_sampler

def delete_glove():
    if 'glove_vecs' in globals():
        global glove_vecs, glove_sampler
        del glove_vecs
        del glove_sampler

def delete_ft():
    if 'ft_vecs' in globals():
        global ft_vecs, ft_sampler
        del ft_vecs
        del ft_sampler
In [3]:
#set up standard corpora + vectors
vfair_all = Setup.make_standard_sampler_and_vecs('vfair',5,100,1) #window=5, dims=100, min_count=1
heartd_all = Setup.make_standard_sampler_and_vecs('heartd',5,100,1) #window=5, dims=100, min_count=1

what = [['Vanity Fair (vfair)'],['Heart of Darkness (heartd)']]
for i,c in enumerate([vfair_all,heartd_all]):
    sampler = c['sampler']
    what[i].extend([sum(sampler.counts.values()), len(sampler.counts)])

show_table(what, headers=['Corpus','Tokens','Types'], title="Corpora sizes")
Corpora sizes
Corpus Tokens Types
Vanity Fair (vfair) 310722 15803
Heart of Darkness (heartd) 38897 5420

Estimating frequencies

Unfortunately for us, pretrained word embeddings do not typically provide detailed information about the frequencies of the items. (In fact, I have not come across any that do.) At best, published information describes the corpus and a bit about how the embeddings were created.

However, we can make a crude estimate of the word frequencies by using a simplified version of Zipf's law, which says that the frequency of the i-th ranked word is roughly proportional to the inverse of the rank.

$$freq(i) ≅ \frac{k}{i}$$

While there are lots of issues with Zipf's law, it will have to do for our purposes. The trick is how to calculate k, since it varies from one corpus to another. For the rank, it seems like the words in the embedding are ordered by their frequency, so we can get the rank directly from the embedding. We need a frequency estimate for one word in order to calculate k. Given the difference between the actual distribution of words and Zipf's law, a medium-high ranked word would work best, but since this approach is so crude, it doesn't matter much.

For Google News, the description says that the minimum frequency of words included is 5, so we can calculate k from a low ranked word, e.g. RAFFAELE (rank = 2,999,996):

$$k_{GN} = i * freq(i) ⇒ k_{GN} = 2,999,996 * 5 ⇒ k_{GN} = 14,999,980$$

For GLoVe and FastText, we'll procede a bit differently since we do not have any frequency information at all about the words included. From the Google ngram viewer, we can get the relative frequency of dog in the year 2000, which is 0.0040587344%. Since the description of GLoVe said they had approximately 6,000,000,000 tokens, the frequency of dog (rank = 2927) in the GLoVe vectors is roughly: 0.0040587344% * 6,000,000,000 = 243,524. So we have:

$$k_G = i * freq(i) ⇒ k_G = 2927 * 243,524 ⇒ k_G = 712,794,748$$

For FastText, with a corpus of 16,000,000,000, we get 0.0040587344% * 16,000,000,000 = 649,398 and so, with rank(dog) = 2370:

$$k_{FT} = i * freq(i) ⇒ k_{FT} = 2370 * 649,398 ⇒ k_{FT} = 1,539,073,260$$

Alternatively, since the bulk of the FastText corpus is from Wikipedia, we can use the first 1B words of Wikipedia (helpful instructions here) to estimate the frequency of dog. When we do that, we get 800,246. So k' (which is used here) is:

$$k_{FT}' = i * freq(i) ⇒ k_{FT}' = 2370 * 800,246 ⇒ k_{FT}' = 1,896,583,020$$

The discrepancy between the estimates shows just how crude they are.

In [4]:
def show_first(vecs, name, n=30):
    show_table([[i,x] for i,x in enumerate(vecs.index2entity[:n])],[],'First %d words from %s' % (n,name))
    
def test_common_words(vecs,name):
    testwds = ['a','an','the','about','from','in','of','to','out','up','very']
    d = []
    for w in testwds:
        if w in vecs.vocab:
            win = 'True'
        else:
            win = '<b>False</b>'            
        d.append([w, win])
        
        w = w.capitalize()
        if w in vecs.vocab:
            win = 'True'
        else:
            win = '<b>False</b>'
        d.append([w, w in vecs.vocab])

    show_table(d,['Word','In %s' % name],'Test of some English words in %s' % name)


def test_non_english_words(vecs,name,topn=5,lowercase=False):
    testwds = ['English','butterfly','cat','dog','the',
               'français','papillon','chat','chien','le',
               'Deutsch','Schmetterling','Katze','Hund','der',
               'italiano','farfalla','gatto','cane','il',
               'español','mariposa','gato','perro','el']
    
    if lowercase:
        testwds = [w.lower() for w in testwds]

    d = []
    for w in testwds:
        if w in vecs.vocab:
            sims = [w, ', '.join([x[0] for x in vecs.similar_by_word(w,topn=topn)])]

        else:
            sims = [w,'N.A.']
        d.append(sims)

    show_table(d,['Word','%d Most similar' % topn],
               'Test of possible non-English words in %s' % name)

Other issues with pretrained embeddings

There are a variety of other issues with many pretrained embeddings. One issue is that some of them (e.g. Google News and GLoVe) include phrases in addition to words. I have filtered those out here.

Another issue is that there are often some non-word items included, such as punctuation. Here are the top 30 items for FastText. (Information for the other 2 corpora are in the appendix.)

In [5]:
ft_vecs, ft_sampler = Setup.setup_FT_English()
In [6]:
print("Vocabulary: %d\tDimensions: %d" % (len(ft_vecs.vocab), ft_vecs.vector_size))
Vocabulary: 999994	Dimensions: 300
In [7]:
show_first(ft_vecs,'FastText English',30)
First 30 words from FastText English
0,
1the
2.
3and
4of
5to
6in
7a
8"
9:
10)
11that
12(
13is
14for
15on
16*
17with
18as
19it
20The
21or
22was
23'
24's
25by
26from
27at
28I
29this

The embeddins also differ in terms of whether the items are case sensitive, and even which words are included — Google News does not include of though it does include Of (see the appendix. This table shows that FastText is case sensitive.

In [8]:
test_common_words(ft_vecs,'FastText English')
Test of some English words in FastText English
Word In FastText English
a True
A True
an True
An True
the True
The True
about True
About True
from True
From True
in True
In True
of True
Of True
to True
To True
out True
Out True
up True
Up True
very True
Very True

Yet another issue to be aware of is the presence of non-English words. Here we have some results for FastText English, testing synonyms in English, French, German and Italian. There are quite a number of non-English words. (Again, information for the other 2 corpora are in the appendix.)

In [9]:
test_non_english_words(ft_vecs, 'FastText English')
Test of possible non-English words in FastText English
Word 5 Most similar
English French, Engish, Spanish, Enlgish, english
butterfly butterflies, Butterfly, dragonfly, nymphalid, caterpillar
cat cats, feline, kitten, Cat, felines
dog dogs, puppy, Dog, canine, Mixed-breed
the of, a, on, in, to
français Français, francais, Parlez-vous, française, populaire
papillon bichon, pinscher, papillons, chien, shih-tzu
chat chats, chatting, Chat, chatroom, chatters
chien lapin, papillon, oiseau, coq, voleur
le du, Le, la, les, au
Deutsch Hönigsberg, Evern, Deutch, sprechen, Schneider
Schmetterlingblüht, Mondnacht, fliegende, Kiebitz, Ankunft
Katze Ploegh, Furcht, fliegt, tanzt, Schafe
Hund Dogge, Hunden, Hunde, hund, Wehe
der Der, und, von, zur, zum
italiano italiani, linguaggio, inglese, enciclopedico, progetto
farfalla N.A.
gatto papà, miele, prete, ragazzo, tocca
cane canes, sugarcane, sugar, sugar-cane, Cane
il Il, miglior, faut, sorpasso, mostro
español castellano, Español, inglés, espanol, española
mariposa silverspot, dorada, monardella, araña, mariposas
gato perro, conejo, perra, ratón, león
perro ratón, perra, conejo, león, pequeño
el El, del, chapo, campeador, al

The properties

Continuing with FastText, we can look for frequency and distributional effects using the summary tests from the previous post. First the test results in the table. Unfortunately, the tests for stratfication of rank and of reciprocity take extremely long for large vocabularies, so I will omit them here.

In [10]:
smplr = ft_sampler
vs = ft_vecs
name = 'FastText'
tests = ['vfreq','sksim','stfreq'] #,'strank','strecip']
testfs(name,smplr,vs,tests=tests)
Testing Vectors ∝ freqs
Testing Vectors ∝ non-v. low freqs
Testing Vectors ∝ non-low freqs
Testing Skewed sims
Testing Stratification of freq
Summary of possible frequency effects for FastText
Aspect Result Details
Vectors ∝ freqs strong percentiles 0-100, R2 = 0.7724
Vectors ∝ non-v. low freqsstrong percentiles 1-100, R2 = 0.7836
Vectors ∝ non-low freqs strong percentiles 5-100, R2 = 0.7924
Skewed sims moderate mean = 0.2706, variance = 0.0103
Stratification of freq moderate, directR2 = 0.3538
Regression coefficient: c = 0.0003


We see strong results for the encoding of frequency in the vectors, but only moderate skewing of similarities and stratification of frequencies. When we compare these results with the results of FastText with Vanity Fair, we see that the relative strengths are reversed, as is the direction of stratification:

In [11]:
smplr = vfair_all['sampler']
vs = vfair_all['ft']
name = 'Vanity Fair with FastText'
tests = ['vfreq','sksim','stfreq'] #,'strank','strecip']
testfs(name,smplr,vs,tests=tests)
Testing Vectors ∝ freqs
Testing Vectors ∝ non-v. low freqs
Testing Vectors ∝ non-low freqs
Testing Skewed sims
Testing Stratification of freq
Summary of possible frequency effects for Vanity Fair with FastText
Aspect Result Details
Vectors ∝ freqs moderate percentiles 0-100, R2 = 0.5253
Vectors ∝ non-v. low freqsmoderate percentiles 1-100, R2 = 0.6722
Vectors ∝ non-low freqs moderate percentiles 5-100, R2 = 0.7319
Skewed sims strong mean = 0.9307, variance = 0.0020
Stratification of freq strong, inverseR2 = 0.9124
Regression coefficient: c = -0.0014


Next the visualization based tests. Again, the large vocabulary poses challenges, this time for the power law test, so we'll use a stratified (by percentile) sampling of the vocabulary rather than the whole vocabulary.

In [12]:
smplr = ft_sampler
vs = ft_vecs
name = 'FastText'
tests = ['vpowers','dpmean','dims']
testfs(name,smplr,vs,tests=tests)
Summary of possible frequency effects for FastText

Nearest neighbor frequency power law (stratified sampling)
Dot products of sims with mean by frequency
Dimension values

For FastText large scale English vectors, we see a solid powerlaw relation for the k-nearest neighbors, and the dot product trend is the same as that observed in [4]. The most striking result is the dimension values: they are all tightly clustered around 0, unlike any of the other vectors, which show much greater dispersion.

For comparison, we have FastText used with Vanity Fair, where the power law is not particularly evident, and it's an inverse relation, unlike with other vectors.

In [13]:
smplr = vfair_all['sampler']
vs = vfair_all['ft']
name = 'Vanity Fair with FastText'
tests = ['vpower','dpmean','dims']
testfs(name,smplr,vs,tests=tests)
Summary of possible frequency effects for Vanity Fair with FastText

Nearest neighbor frequency power law
Dot products of sims with mean by frequency
Dimension values

The hub tests also take a long time, so I will omit them as well.

In [14]:
#smplr = ft_sampler
#vs = ft_vecs
#name = 'FastText'
#tests = ['hubs','hubp']
#testfs(name,smplr,vs,tests=tests)
In [15]:
delete_ft()

Next up is the Google News, which is a cbow approach. Here the frequency encoding is weak to moderate, as is the stratification of frequencies.

In [16]:
gn_vecs, gn_sampler = Setup.setup_GoogleNews()
In [17]:
print("Vocabulary: %d\tDimensions: %d" % (len(gn_vecs.vocab), gn_vecs.vector_size))
Vocabulary: 3000000	Dimensions: 300
In [18]:
smplr = gn_sampler
vs = gn_vecs
name = 'Google News'
tests = ['vfreq','sksim','stfreq'] #,'strank','strecip']
testfs(name,smplr,vs,tests=tests)
Testing Vectors ∝ freqs
Testing Vectors ∝ non-v. low freqs
Testing Vectors ∝ non-low freqs
Testing Skewed sims
Testing Stratification of freq
Summary of possible frequency effects for Google News
Aspect Result Details
Vectors ∝ freqs weak percentiles 0-100, R2 = 0.1173
Vectors ∝ non-v. low freqsmoderate percentiles 1-100, R2 = 0.4729
Vectors ∝ non-low freqs moderate percentiles 5-100, R2 = 0.5626
Skewed sims moderate mean = 0.1047, variance = 0.0100
Stratification of freq moderate, directR2 = 0.4672
Regression coefficient: c = 0.0002


Since the Google News vectors were created with cbow version of word2vec, we can compare them to the cbow vectors for Vanity Fair. The results are similar, except for the direction of the frequency stratification, which is direct for Google News, but inverse for Vanity Fair.

In [19]:
vfair_all['cbow'] = Setup.make_vecs('cbow', vfair_all['sampler'].sents, 1,5,100,init_sims=True) #window=5, dims=100, min_count=1
smplr = vfair_all['sampler']
vs = vfair_all['cbow']
name = 'Vanity Fair with cbow'
tests = ['vfreq','sksim','stfreq'] #,'strank','strecip']
testfs(name,smplr,vs,tests=tests)
Testing Vectors ∝ freqs
Testing Vectors ∝ non-v. low freqs
Testing Vectors ∝ non-low freqs
Testing Skewed sims
Testing Stratification of freq
Summary of possible frequency effects for Vanity Fair with cbow
Aspect Result Details
Vectors ∝ freqs weak percentiles 0-100, R2 = 0.0337
Vectors ∝ non-v. low freqsweak percentiles 1-100, R2 = 0.1166
Vectors ∝ non-low freqs moderate percentiles 5-100, R2 = 0.5234
Skewed sims strong mean = 0.7506, variance = 0.0693
Stratification of freq moderate, inverseR2 = 0.3965
Regression coefficient: c = -0.0009


We can now turn to the visual results, where we see a fairly good powerlaw relation. The dot product trend is also similar to what we saw with sgns.

In [20]:
smplr = gn_sampler
vs = gn_vecs
name = 'Google News'
tests = ['vpowers','dpmean','dims']
testfs(name,smplr,vs,tests=tests)
Summary of possible frequency effects for Google News

Nearest neighbor frequency power law (stratified sampling)
Dot products of sims with mean by frequency
Dimension values

When we compare Google News with Vanity Fair, we see that Vanity Fair does not have a great powerlaw relationship, and the dot product trend is not as clear as it is with Google News.

In [21]:
smplr = vfair_all['sampler']
vs = vfair_all['cbow']
name = 'Vanity Fair with cbow'
tests = ['vpower','dpmean','dims']
testfs(name,smplr,vs,tests=tests)
Summary of possible frequency effects for Vanity Fair with cbow

Nearest neighbor frequency power law
Dot products of sims with mean by frequency
Dimension values

In [22]:
#smplr = gn_sampler
#vs = gn_vecs
#name = 'Google News'
#tests = ['hubs','hubp']
#testfs(name,smplr,vs,tests=tests)
In [23]:
delete_gn()

Finally, we turn to GLoVe. It shows a strong encoding of frequency, but only a moderate skewing of similarities, and a weak, direct, frequency stratification. A similar pattern is seen with GLoVe vectors for Vanity Fair below, though the encoding of frequency is more moderate.

In [24]:
glove_vecs,glove_sampler = Setup.setup_Glove_pre(100)
In [25]:
print("Vocabulary: %d\tDimensions: %d" % (len(glove_vecs.vocab), glove_vecs.vector_size))
Vocabulary: 400000	Dimensions: 100
In [26]:
smplr = glove_sampler
vs = glove_vecs
name = 'GLoVe'
tests = ['vfreq','sksim','stfreq'] #,'strank','strecip']
testfs(name,smplr,vs,tests=tests)
Testing Vectors ∝ freqs
Testing Vectors ∝ non-v. low freqs
Testing Vectors ∝ non-low freqs
Testing Skewed sims
Testing Stratification of freq
Summary of possible frequency effects for GLoVe
Aspect Result Details
Vectors ∝ freqs strong percentiles 0-100, R2 = 0.8071
Vectors ∝ non-v. low freqsstrong percentiles 1-100, R2 = 0.8346
Vectors ∝ non-low freqs strong percentiles 5-100, R2 = 0.8891
Skewed sims moderate mean = 0.1318, variance = 0.0299
Stratification of freq weak, directR2 = 0.0594
Regression coefficient: c = 0.0004


In [27]:
smplr = vfair_all['sampler']
vs = vfair_all['glove']
name = 'Vanity Fair with glove'
tests = ['vfreq','sksim','stfreq'] #,'strank','strecip']
testfs(name,smplr,vs,tests=tests)
Testing Vectors ∝ freqs
Testing Vectors ∝ non-v. low freqs
Testing Vectors ∝ non-low freqs
Testing Skewed sims
Testing Stratification of freq
Summary of possible frequency effects for Vanity Fair with glove
Aspect Result Details
Vectors ∝ freqs moderate percentiles 0-100, R2 = 0.2537
Vectors ∝ non-v. low freqsmoderate percentiles 1-100, R2 = 0.4524
Vectors ∝ non-low freqs strong percentiles 5-100, R2 = 0.7749
Skewed sims weak mean = 0.0694, variance = 0.0489
Stratification of freq moderate, directR2 = 0.6367
Regression coefficient: c = 0.0026


In the last comparison, we have the visual results. The powerlaw is fairly good, and the dot product trend confirms the result in [4]. However, when we look at Vanity Fair, the dot product trend is more like what we see with sgns and cbow, not what we see with the large corpus GLoVe vectors.

In [28]:
smplr = glove_sampler
vs = glove_vecs
name = 'GLoVe'
tests = ['vpowers','dpmean','dims']
testfs(name,smplr,vs,tests=tests)
Summary of possible frequency effects for GLoVe

Nearest neighbor frequency power law (stratified sampling)
Dot products of sims with mean by frequency
Dimension values

In [29]:
smplr = vfair_all['sampler']
vs = vfair_all['glove']
name = 'Vanity Fair with glove'
tests = ['vpower','dpmean','dims']
testfs(name,smplr,vs,tests=tests)
Summary of possible frequency effects for Vanity Fair with glove

Nearest neighbor frequency power law
Dot products of sims with mean by frequency
Dimension values

In [30]:
#smplr = glove_sampler
#vs = glove_vecs
#name = 'GLoVe'
#tests = ['hubs','hubp']
#testfs(name,smplr,vs,tests=tests)
In [31]:
delete_glove()

To sum up, here we have a summary of the summaries:

FastTextcbowGLoVe
largevfairlargevfairlargevfair
freq encodedstrongmoderatemoderateweak+strongmoderate+
skewed simsmoderatestrongmoderatemoderatemoderateweak
freq stratifiedmoderate
direct
strong
inverse
moderate
direct
moderate
inverse
weak
direct
moderate
direct
powerlawgoodinversegoodso-sogoodso-so
dot productdecreasing posdecreasing posdecreasing posmixeddecreasing pos to negdecreasing pos to neg

In addition, we saw that FastText English had an unusual distribution of dimension values, clustered tightly around 0.

Some patterns among the summaries:

  • Frequency encoding is stronger for the larger corpora than for Vanity Fair
  • Frequency stratfication tends to be stronger for Vanity Fair than for the larger corpora (except in the case of GLoVe)
  • Frequency stratification tends to be direct for the larger corpora and indirect for Vanity Fair (except in the case of GLoVe)
  • Powerlaw for nearest neighbors is stronger for the larger corpora than for Vanity Fair

Finally, we can note that overall there is only moderate skewing of similarities, although that is what prompted this investigation.

The posts

References

[1] Google News: https://code.google.com/archive/p/word2vec/, published as Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their composi- tionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States. 3111–3119.

[2] Glove: https://nlp.stanford.edu/projects/glove/, pubished as "GloVe: Global Vectors for Word Representation" Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Empirical Methods in Natural Language Processing (EMNLP) 2014. pp. 1532--1543

[3] FastText (English): https://fasttext.cc/docs/en/english-vectors.html, published as Tomas Mikolov, Edourd Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. 2018. Advances in Pre-Training Distributed Word Representations. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018).

[4] David Mimno and Laure Thompson. 2017. The strange geometry of skip-gram with negative sampling. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2873–2878.

 

Appendix

Information about items in Google News embeddings

In [32]:
gn_vecs, gn_sampler = Setup.setup_GoogleNews()
In [33]:
show_first(gn_vecs,'GoogleNews',30)
First 30 words from GoogleNews
0
1in
2for
3that
4is
5on
6##
7The
8with
9said
10was
11the
12at
13not
14as
15it
16be
17from
18by
19are
20I
21have
22he
23will
24has
25####
26his
27an
28this
29or
In [34]:
test_common_words(gn_vecs,'GoogleNews')
Test of some English words in GoogleNews
Word In GoogleNews
a False
A True
an True
An True
the True
The True
about True
About True
from True
From True
in True
In True
of False
Of True
to False
To True
out True
Out True
up True
Up True
very True
Very True
In [35]:
test_non_english_words(gn_vecs,'GoogleNews')
Test of possible non-English words in GoogleNews
Word 5 Most similar
English english, Engish, Funeral_Home_Oakmont, Malaya_Gruzinskaya_M._Barrikadnaya, language
butterfly backstroke, endangered_Karner_blue, Diana_Fritillary, yard_backstroke, yard_freestyle
cat cats, dog, kitten, feline, beagle
dog dogs, puppy, pit_bull, pooch, cat
the this, in, that, ofthe, another
français française, canadien, francais, canadienne, n'est_pas
papillon sheltie, schipperke, bichon, Miniature_Pinscher, standard_poodle
chat chats, chatting, Chat, chatted, chit_chat
chien chu, tien, 颜, Chen_Chih, 吴
le à, l', les, du, qui
Deutsch Weiss, Rosen, Stein, Siegel, Klein
SchmetterlingHanski, Fryda, Kausch, Bonapace, Gerald_Mayr
Katze Margolskee, Velicer, Thummel, Varki, Kubanek
Hund Wettstein, Ihrke, Hoeschen, Holschbach, Knoke
der und, ein, zum, zu, eine
italiano que_ha, cómo, del_mundo, gioco, completa
farfalla N.A.
gatto N.A.
cane canes, Radoslovich_juggles, fireplace_tongs, sugarcane, walker
il Il, su, sul, di, nel
español hablar, hablan, palabras, idioma, ¿_Qué
mariposa palo, azul, arriba, gente, niña
gato perro, trabajo, buena, ojos, arriba
perro gato, es_muy, ¿_Qué, quiero, mujer
el El, al, fuera_de_las, se_debe, trabajo
In [36]:
delete_gn()

Information about items in GLoVe embeddings

In [38]:
glove_vecs,glove_sampler = Setup.setup_Glove_pre(100)
In [39]:
show_first(glove_vecs,'glove',30)
First 30 words from glove
0the
1,
2.
3of
4to
5and
6in
7a
8"
9's
10for
11-
12that
13on
14is
15was
16said
17with
18he
19as
20it
21by
22at
23(
24)
25from
26his
27''
28``
29an
In [40]:
test_common_words(glove_vecs,'Glove')
Test of some English words in Glove
Word In Glove
a True
A False
an True
An False
the True
The False
about True
About False
from True
From False
in True
In False
of True
Of False
to True
To False
out True
Out False
up True
Up False
very True
Very False
In [41]:
test_non_english_words(glove_vecs,'Glove', lowercase=True)
Test of possible non-English words in Glove
Word 5 Most similar
english welsh, language, irish, scottish, british
butterfly 200m, medley, 200-meter, breaststroke, backstroke
cat dog, rabbit, cats, monkey, pet
dog cat, dogs, pet, puppy, horse
the this, part, one, of, same
français francais, collège, stade, théâtre, artistes
papillon dernier, sauvage, hommes, mystère, chevalier
chat chats, forums, messaging, web, chatting
chien chao, chih, shih, kuo, huang
le du, petit, monde, la, mans
deutsch litt, heller, fleischman, vogel, rizzo
schmetterlingN.A.
katze tyska, fette, gewalt, hetu, moderados
hund corbeau, slan, holdstock, crazylegs, paschen
der und, van, deutschen, den, das
italiano audax, mobiliare, commerciale, colo, credito
farfalla N.A.
gatto boye, caro, recio, flavus, cuore
cane sugarcane, bamboo, sugar, coconut, banana
il jong, kim, nam, yong, giornale
español espanol, obrero, atlético, tigre, españa
mariposa tuolumne, comal, mendocino, bayou, camas
gato lukamba, destino, cerebro, hombre, perro
perro árbol, cuerpo, traje, fantasma, hijo
el del, en, la, al, de
In [42]:
delete_glove()