Distributional and frequency effects in word embeddings: Large corpora¶

© 2018 Chris Culy, August 2018¶

chrisculy.net ¶

Overview¶

This is one of a series of posts on extending the preceding posts on frequency effects to more embedding techniques and more corpora. In this post I look at some embeddings based on large corpora, embeddings which have been widely used as sample testbeds. In particular, I will be examining the Google News embeddings, the GLoVe embeddings and the FastText English embeddings.

Results and contributions¶

new Frequency encoding is stronger for the larger corpora than for Vanity Fair
new Frequency stratfication tends to be stronger for Vanity Fair than for the larger corpora (except in the case of GLoVe)
new Frequency stratification tends to be direct for the larger corpora and indirect for Vanity Fair (except in the case of GLoVe)
new Powerlaw for nearest neighbors is stronger for the larger corpora than for Vanity Fair
Similarity skewness is moderate

Download as Jupyter notebook

Download supplemental Python code

Download summary test Python code

Show Code

%load_ext autoreload
%autoreload 2

#imports
from dfewe import *
from dfewe_nb.nb_utils import *
from dfewe_nb.freq_tests import run_tests as testfs

#to free up memory
def delete_gn():
    if 'gn_vecs' in globals():
        global gn_vecs, gn_sampler
        del gn_vecs
        del gn_sampler

def delete_glove():
    if 'glove_vecs' in globals():
        global glove_vecs, glove_sampler
        del glove_vecs
        del glove_sampler

def delete_ft():
    if 'ft_vecs' in globals():
        global ft_vecs, ft_sampler
        del ft_vecs
        del ft_sampler

#set up standard corpora + vectors
vfair_all = Setup.make_standard_sampler_and_vecs('vfair',5,100,1) #window=5, dims=100, min_count=1
heartd_all = Setup.make_standard_sampler_and_vecs('heartd',5,100,1) #window=5, dims=100, min_count=1

what = [['Vanity Fair (vfair)'],['Heart of Darkness (heartd)']]
for i,c in enumerate([vfair_all,heartd_all]):
    sampler = c['sampler']
    what[i].extend([sum(sampler.counts.values()), len(sampler.counts)])

show_table(what, headers=['Corpus','Tokens','Types'], title="Corpora sizes")

Estimating frequencies¶

Unfortunately for us, pretrained word embeddings do not typically provide detailed information about the frequencies of the items. (In fact, I have not come across any that do.) At best, published information describes the corpus and a bit about how the embeddings were created.

However, we can make a crude estimate of the word frequencies by using a simplified version of Zipf's law, which says that the frequency of the i-th ranked word is roughly proportional to the inverse of the rank.

$$freq(i) ≅ \frac{k}{i}$$

While there are lots of issues with Zipf's law, it will have to do for our purposes. The trick is how to calculate k, since it varies from one corpus to another. For the rank, it seems like the words in the embedding are ordered by their frequency, so we can get the rank directly from the embedding. We need a frequency estimate for one word in order to calculate k. Given the difference between the actual distribution of words and Zipf's law, a medium-high ranked word would work best, but since this approach is so crude, it doesn't matter much.

For Google News, the description says that the minimum frequency of words included is 5, so we can calculate k from a low ranked word, e.g. RAFFAELE (rank = 2,999,996):

$$k_{GN} = i * freq(i) ⇒ k_{GN} = 2,999,996 * 5 ⇒ k_{GN} = 14,999,980$$

For GLoVe and FastText, we'll procede a bit differently since we do not have any frequency information at all about the words included. From the Google ngram viewer, we can get the relative frequency of dog in the year 2000, which is 0.0040587344%. Since the description of GLoVe said they had approximately 6,000,000,000 tokens, the frequency of dog (rank = 2927) in the GLoVe vectors is roughly: 0.0040587344% * 6,000,000,000 = 243,524. So we have:

$$k_G = i * freq(i) ⇒ k_G = 2927 * 243,524 ⇒ k_G = 712,794,748$$

For FastText, with a corpus of 16,000,000,000, we get 0.0040587344% * 16,000,000,000 = 649,398 and so, with rank(dog) = 2370:

$$k_{FT} = i * freq(i) ⇒ k_{FT} = 2370 * 649,398 ⇒ k_{FT} = 1,539,073,260$$

Alternatively, since the bulk of the FastText corpus is from Wikipedia, we can use the first 1B words of Wikipedia (helpful instructions here) to estimate the frequency of dog. When we do that, we get 800,246. So k' (which is used here) is:

$$k_{FT}' = i * freq(i) ⇒ k_{FT}' = 2370 * 800,246 ⇒ k_{FT}' = 1,896,583,020$$

The discrepancy between the estimates shows just how crude they are.

def show_first(vecs, name, n=30):
    show_table([[i,x] for i,x in enumerate(vecs.index2entity[:n])],[],'First %d words from %s' % (n,name))
    
def test_common_words(vecs,name):
    testwds = ['a','an','the','about','from','in','of','to','out','up','very']
    d = []
    for w in testwds:
        if w in vecs.vocab:
            win = 'True'
        else:
            win = '<b>False</b>'            
        d.append([w, win])
        
        w = w.capitalize()
        if w in vecs.vocab:
            win = 'True'
        else:
            win = '<b>False</b>'
        d.append([w, w in vecs.vocab])

    show_table(d,['Word','In %s' % name],'Test of some English words in %s' % name)


def test_non_english_words(vecs,name,topn=5,lowercase=False):
    testwds = ['English','butterfly','cat','dog','the',
               'français','papillon','chat','chien','le',
               'Deutsch','Schmetterling','Katze','Hund','der',
               'italiano','farfalla','gatto','cane','il',
               'español','mariposa','gato','perro','el']
    
    if lowercase:
        testwds = [w.lower() for w in testwds]

    d = []
    for w in testwds:
        if w in vecs.vocab:
            sims = [w, ', '.join([x[0] for x in vecs.similar_by_word(w,topn=topn)])]

        else:
            sims = [w,'N.A.']
        d.append(sims)

    show_table(d,['Word','%d Most similar' % topn],
               'Test of possible non-English words in %s' % name)

Other issues with pretrained embeddings¶

There are a variety of other issues with many pretrained embeddings. One issue is that some of them (e.g. Google News and GLoVe) include phrases in addition to words. I have filtered those out here.

Another issue is that there are often some non-word items included, such as punctuation. Here are the top 30 items for FastText. (Information for the other 2 corpora are in the appendix.)

ft_vecs, ft_sampler = Setup.setup_FT_English()

print("Vocabulary: %d\tDimensions: %d" % (len(ft_vecs.vocab), ft_vecs.vector_size))

Vocabulary: 999994	Dimensions: 300

show_first(ft_vecs,'FastText English',30)

The embeddins also differ in terms of whether the items are case sensitive, and even which words are included — Google News does not include of though it does include Of (see the appendix. This table shows that FastText is case sensitive.

test_common_words(ft_vecs,'FastText English')

Yet another issue to be aware of is the presence of non-English words. Here we have some results for FastText English, testing synonyms in English, French, German and Italian. There are quite a number of non-English words. (Again, information for the other 2 corpora are in the appendix.)

test_non_english_words(ft_vecs, 'FastText English')

The properties¶

Continuing with FastText, we can look for frequency and distributional effects using the summary tests from the previous post. First the test results in the table. Unfortunately, the tests for stratfication of rank and of reciprocity take extremely long for large vocabularies, so I will omit them here.

smplr = ft_sampler
vs = ft_vecs
name = 'FastText'
tests = ['vfreq','sksim','stfreq'] #,'strank','strecip']
testfs(name,smplr,vs,tests=tests)

Testing Vectors ∝ freqs
Testing Vectors ∝ non-v. low freqs
Testing Vectors ∝ non-low freqs
Testing Skewed sims
Testing Stratification of freq

We see strong results for the encoding of frequency in the vectors, but only moderate skewing of similarities and stratification of frequencies. When we compare these results with the results of FastText with Vanity Fair, we see that the relative strengths are reversed, as is the direction of stratification:

smplr = vfair_all['sampler']
vs = vfair_all['ft']
name = 'Vanity Fair with FastText'
tests = ['vfreq','sksim','stfreq'] #,'strank','strecip']
testfs(name,smplr,vs,tests=tests)

Testing Vectors ∝ freqs
Testing Vectors ∝ non-v. low freqs
Testing Vectors ∝ non-low freqs
Testing Skewed sims
Testing Stratification of freq

Next the visualization based tests. Again, the large vocabulary poses challenges, this time for the power law test, so we'll use a stratified (by percentile) sampling of the vocabulary rather than the whole vocabulary.

smplr = ft_sampler
vs = ft_vecs
name = 'FastText'
tests = ['vpowers','dpmean','dims']
testfs(name,smplr,vs,tests=tests)

For FastText large scale English vectors, we see a solid powerlaw relation for the k-nearest neighbors, and the dot product trend is the same as that observed in [4]. The most striking result is the dimension values: they are all tightly clustered around 0, unlike any of the other vectors, which show much greater dispersion.

For comparison, we have FastText used with Vanity Fair, where the power law is not particularly evident, and it's an inverse relation, unlike with other vectors.

smplr = vfair_all['sampler']
vs = vfair_all['ft']
name = 'Vanity Fair with FastText'
tests = ['vpower','dpmean','dims']
testfs(name,smplr,vs,tests=tests)

The hub tests also take a long time, so I will omit them as well.

#smplr = ft_sampler
#vs = ft_vecs
#name = 'FastText'
#tests = ['hubs','hubp']
#testfs(name,smplr,vs,tests=tests)

delete_ft()

Next up is the Google News, which is a cbow approach. Here the frequency encoding is weak to moderate, as is the stratification of frequencies.

gn_vecs, gn_sampler = Setup.setup_GoogleNews()

print("Vocabulary: %d\tDimensions: %d" % (len(gn_vecs.vocab), gn_vecs.vector_size))

Vocabulary: 3000000	Dimensions: 300

smplr = gn_sampler
vs = gn_vecs
name = 'Google News'
tests = ['vfreq','sksim','stfreq'] #,'strank','strecip']
testfs(name,smplr,vs,tests=tests)

Testing Vectors ∝ freqs
Testing Vectors ∝ non-v. low freqs
Testing Vectors ∝ non-low freqs
Testing Skewed sims
Testing Stratification of freq

Since the Google News vectors were created with cbow version of word2vec, we can compare them to the cbow vectors for Vanity Fair. The results are similar, except for the direction of the frequency stratification, which is direct for Google News, but inverse for Vanity Fair.

vfair_all['cbow'] = Setup.make_vecs('cbow', vfair_all['sampler'].sents, 1,5,100,init_sims=True) #window=5, dims=100, min_count=1
smplr = vfair_all['sampler']
vs = vfair_all['cbow']
name = 'Vanity Fair with cbow'
tests = ['vfreq','sksim','stfreq'] #,'strank','strecip']
testfs(name,smplr,vs,tests=tests)

Testing Vectors ∝ freqs
Testing Vectors ∝ non-v. low freqs
Testing Vectors ∝ non-low freqs
Testing Skewed sims
Testing Stratification of freq

We can now turn to the visual results, where we see a fairly good powerlaw relation. The dot product trend is also similar to what we saw with sgns.

smplr = gn_sampler
vs = gn_vecs
name = 'Google News'
tests = ['vpowers','dpmean','dims']
testfs(name,smplr,vs,tests=tests)

When we compare Google News with Vanity Fair, we see that Vanity Fair does not have a great powerlaw relationship, and the dot product trend is not as clear as it is with Google News.

smplr = vfair_all['sampler']
vs = vfair_all['cbow']
name = 'Vanity Fair with cbow'
tests = ['vpower','dpmean','dims']
testfs(name,smplr,vs,tests=tests)

#smplr = gn_sampler
#vs = gn_vecs
#name = 'Google News'
#tests = ['hubs','hubp']
#testfs(name,smplr,vs,tests=tests)

delete_gn()

Finally, we turn to GLoVe. It shows a strong encoding of frequency, but only a moderate skewing of similarities, and a weak, direct, frequency stratification. A similar pattern is seen with GLoVe vectors for Vanity Fair below, though the encoding of frequency is more moderate.

glove_vecs,glove_sampler = Setup.setup_Glove_pre(100)

print("Vocabulary: %d\tDimensions: %d" % (len(glove_vecs.vocab), glove_vecs.vector_size))

Vocabulary: 400000	Dimensions: 100

smplr = glove_sampler
vs = glove_vecs
name = 'GLoVe'
tests = ['vfreq','sksim','stfreq'] #,'strank','strecip']
testfs(name,smplr,vs,tests=tests)

Testing Vectors ∝ freqs
Testing Vectors ∝ non-v. low freqs
Testing Vectors ∝ non-low freqs
Testing Skewed sims
Testing Stratification of freq

smplr = vfair_all['sampler']
vs = vfair_all['glove']
name = 'Vanity Fair with glove'
tests = ['vfreq','sksim','stfreq'] #,'strank','strecip']
testfs(name,smplr,vs,tests=tests)

Testing Vectors ∝ freqs
Testing Vectors ∝ non-v. low freqs
Testing Vectors ∝ non-low freqs
Testing Skewed sims
Testing Stratification of freq

In the last comparison, we have the visual results. The powerlaw is fairly good, and the dot product trend confirms the result in [4]. However, when we look at Vanity Fair, the dot product trend is more like what we see with sgns and cbow, not what we see with the large corpus GLoVe vectors.

smplr = glove_sampler
vs = glove_vecs
name = 'GLoVe'
tests = ['vpowers','dpmean','dims']
testfs(name,smplr,vs,tests=tests)

smplr = vfair_all['sampler']
vs = vfair_all['glove']
name = 'Vanity Fair with glove'
tests = ['vpower','dpmean','dims']
testfs(name,smplr,vs,tests=tests)

#smplr = glove_sampler
#vs = glove_vecs
#name = 'GLoVe'
#tests = ['hubs','hubp']
#testfs(name,smplr,vs,tests=tests)

delete_glove()

To sum up, here we have a summary of the summaries:

	FastText		cbow		GLoVe
	large	vfair	large	vfair	large	vfair
freq encoded	strong	moderate	moderate	weak+	strong	moderate+
skewed sims	moderate	strong	moderate	moderate	moderate	weak
freq stratified	moderate direct	strong inverse	moderate direct	moderate inverse	weak direct	moderate direct
powerlaw	good	inverse	good	so-so	good	so-so
dot product	decreasing pos	decreasing pos	decreasing pos	mixed	decreasing pos to neg	decreasing pos to neg

In addition, we saw that FastText English had an unusual distribution of dimension values, clustered tightly around 0.

Some patterns among the summaries:

Frequency encoding is stronger for the larger corpora than for Vanity Fair
Frequency stratfication tends to be stronger for Vanity Fair than for the larger corpora (except in the case of GLoVe)
Frequency stratification tends to be direct for the larger corpora and indirect for Vanity Fair (except in the case of GLoVe)
Powerlaw for nearest neighbors is stronger for the larger corpora than for Vanity Fair

Finally, we can note that overall there is only moderate skewing of similarities, although that is what prompted this investigation.

Back to the introduction

The posts¶

Summary tests
Large corpora

References¶

[1] Google News: https://code.google.com/archive/p/word2vec/, published as Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their composi- tionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States. 3111–3119.

[2] Glove: https://nlp.stanford.edu/projects/glove/, pubished as "GloVe: Global Vectors for Word Representation" Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Empirical Methods in Natural Language Processing (EMNLP) 2014. pp. 1532--1543

[3] FastText (English): https://fasttext.cc/docs/en/english-vectors.html, published as Tomas Mikolov, Edourd Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. 2018. Advances in Pre-Training Distributed Word Representations. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018).

[4] David Mimno and Laure Thompson. 2017. The strange geometry of skip-gram with negative sampling. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2873–2878.

Appendix¶

Information about items in Google News embeddings¶

gn_vecs, gn_sampler = Setup.setup_GoogleNews()

show_first(gn_vecs,'GoogleNews',30)

test_common_words(gn_vecs,'GoogleNews')

test_non_english_words(gn_vecs,'GoogleNews')

delete_gn()

Information about items in GLoVe embeddings¶

glove_vecs,glove_sampler = Setup.setup_Glove_pre(100)

show_first(glove_vecs,'glove',30)

test_common_words(glove_vecs,'Glove')

test_non_english_words(glove_vecs,'Glove', lowercase=True)

delete_glove()

Word	5 Most similar
english	welsh, language, irish, scottish, british
butterfly	200m, medley, 200-meter, breaststroke, backstroke
cat	dog, rabbit, cats, monkey, pet
dog	cat, dogs, pet, puppy, horse
the	this, part, one, of, same
français	francais, collège, stade, théâtre, artistes
papillon	dernier, sauvage, hommes, mystère, chevalier
chat	chats, forums, messaging, web, chatting
chien	chao, chih, shih, kuo, huang
le	du, petit, monde, la, mans
deutsch	litt, heller, fleischman, vogel, rizzo
schmetterling	N.A.
katze	tyska, fette, gewalt, hetu, moderados
hund	corbeau, slan, holdstock, crazylegs, paschen
der	und, van, deutschen, den, das
italiano	audax, mobiliare, commerciale, colo, credito
farfalla	N.A.
gatto	boye, caro, recio, flavus, cuore
cane	sugarcane, bamboo, sugar, coconut, banana
il	jong, kim, nam, yong, giornale
español	espanol, obrero, atlético, tigre, españa
mariposa	tuolumne, comal, mendocino, bayou, camas
gato	lukamba, destino, cerebro, hombre, perro
perro	árbol, cuerpo, traje, fantasma, hijo
el	del, en, la, al, de

Corpus	Tokens	Types
Vanity Fair (vfair)	310722	15803
Heart of Darkness (heartd)	38897	5420

0	,
1	the
2	.
3	and
4	of
5	to
6	in
7	a
8	"
9	:
10	)
11	that
12	(
13	is
14	for
15	on
16	*
17	with
18	as
19	it
20	The
21	or
22	was
23	'
24	's
25	by
26	from
27	at
28	I
29	this

Word	In FastText English
a	True
A	True
an	True
An	True
the	True
The	True
about	True
About	True
from	True
From	True
in	True
In	True
of	True
Of	True
to	True
To	True
out	True
Out	True
up	True
Up	True
very	True
Very	True

Word	5 Most similar
English	French, Engish, Spanish, Enlgish, english
butterfly	butterflies, Butterfly, dragonfly, nymphalid, caterpillar
cat	cats, feline, kitten, Cat, felines
dog	dogs, puppy, Dog, canine, Mixed-breed
the	of, a, on, in, to
français	Français, francais, Parlez-vous, française, populaire
papillon	bichon, pinscher, papillons, chien, shih-tzu
chat	chats, chatting, Chat, chatroom, chatters
chien	lapin, papillon, oiseau, coq, voleur
le	du, Le, la, les, au
Deutsch	Hönigsberg, Evern, Deutch, sprechen, Schneider
Schmetterling	blüht, Mondnacht, fliegende, Kiebitz, Ankunft
Katze	Ploegh, Furcht, fliegt, tanzt, Schafe
Hund	Dogge, Hunden, Hunde, hund, Wehe
der	Der, und, von, zur, zum
italiano	italiani, linguaggio, inglese, enciclopedico, progetto
farfalla	N.A.
gatto	papà, miele, prete, ragazzo, tocca
cane	canes, sugarcane, sugar, sugar-cane, Cane
il	Il, miglior, faut, sorpasso, mostro
español	castellano, Español, inglés, espanol, española
mariposa	silverspot, dorada, monardella, araña, mariposas
gato	perro, conejo, perra, ratón, león
perro	ratón, perra, conejo, león, pequeño
el	El, del, chapo, campeador, al

Aspect	Result	Details
Vectors ∝ freqs	strong	percentiles 0-100, R² = 0.7724
Vectors ∝ non-v. low freqs	strong	percentiles 1-100, R² = 0.7836
Vectors ∝ non-low freqs	strong	percentiles 5-100, R² = 0.7924
Skewed sims	moderate	mean = 0.2706, variance = 0.0103
Stratification of freq	moderate, direct	R² = 0.3538 Regression coefficient: c = 0.0003

Aspect	Result	Details
Vectors ∝ freqs	moderate	percentiles 0-100, R² = 0.5253
Vectors ∝ non-v. low freqs	moderate	percentiles 1-100, R² = 0.6722
Vectors ∝ non-low freqs	moderate	percentiles 5-100, R² = 0.7319
Skewed sims	strong	mean = 0.9307, variance = 0.0020
Stratification of freq	strong, inverse	R² = 0.9124 Regression coefficient: c = -0.0014

Aspect	Result	Details
Vectors ∝ freqs	weak	percentiles 0-100, R² = 0.1173
Vectors ∝ non-v. low freqs	moderate	percentiles 1-100, R² = 0.4729
Vectors ∝ non-low freqs	moderate	percentiles 5-100, R² = 0.5626
Skewed sims	moderate	mean = 0.1047, variance = 0.0100
Stratification of freq	moderate, direct	R² = 0.4672 Regression coefficient: c = 0.0002

Aspect	Result	Details
Vectors ∝ freqs	weak	percentiles 0-100, R² = 0.0337
Vectors ∝ non-v. low freqs	weak	percentiles 1-100, R² = 0.1166
Vectors ∝ non-low freqs	moderate	percentiles 5-100, R² = 0.5234
Skewed sims	strong	mean = 0.7506, variance = 0.0693
Stratification of freq	moderate, inverse	R² = 0.3965 Regression coefficient: c = -0.0009

Aspect	Result	Details
Vectors ∝ freqs	strong	percentiles 0-100, R² = 0.8071
Vectors ∝ non-v. low freqs	strong	percentiles 1-100, R² = 0.8346
Vectors ∝ non-low freqs	strong	percentiles 5-100, R² = 0.8891
Skewed sims	moderate	mean = 0.1318, variance = 0.0299
Stratification of freq	weak, direct	R² = 0.0594 Regression coefficient: c = 0.0004

Aspect	Result	Details
Vectors ∝ freqs	moderate	percentiles 0-100, R² = 0.2537
Vectors ∝ non-v. low freqs	moderate	percentiles 1-100, R² = 0.4524
Vectors ∝ non-low freqs	strong	percentiles 5-100, R² = 0.7749
Skewed sims	weak	mean = 0.0694, variance = 0.0489
Stratification of freq	moderate, direct	R² = 0.6367 Regression coefficient: c = 0.0026

Word	5 Most similar
English	english, Engish, Funeral_Home_Oakmont, Malaya_Gruzinskaya_M._Barrikadnaya, language
butterfly	backstroke, endangered_Karner_blue, Diana_Fritillary, yard_backstroke, yard_freestyle
cat	cats, dog, kitten, feline, beagle
dog	dogs, puppy, pit_bull, pooch, cat
the	this, in, that, ofthe, another
français	française, canadien, francais, canadienne, n'est_pas
papillon	sheltie, schipperke, bichon, Miniature_Pinscher, standard_poodle
chat	chats, chatting, Chat, chatted, chit_chat
chien	chu, tien, 颜, Chen_Chih, 吴
le	à, l', les, du, qui
Deutsch	Weiss, Rosen, Stein, Siegel, Klein
Schmetterling	Hanski, Fryda, Kausch, Bonapace, Gerald_Mayr
Katze	Margolskee, Velicer, Thummel, Varki, Kubanek
Hund	Wettstein, Ihrke, Hoeschen, Holschbach, Knoke
der	und, ein, zum, zu, eine
italiano	que_ha, cómo, del_mundo, gioco, completa
farfalla	N.A.
gatto	N.A.
cane	canes, Radoslovich_juggles, fireplace_tongs, sugarcane, walker
il	Il, su, sul, di, nel
español	hablar, hablan, palabras, idioma, ¿_Qué
mariposa	palo, azul, arriba, gente, niña
gato	perro, trabajo, buena, ojos, arriba
perro	gato, es_muy, ¿_Qué, quiero, mujer
el	El, al, fuera_de_las, se_debe, trabajo

0
1	in
2	for
3	that
4	is
5	on
6	##
7	The
8	with
9	said
10	was
11	the
12	at
13	not
14	as
15	it
16	be
17	from
18	by
19	are
20	I
21	have
22	he
23	will
24	has
25	####
26	his
27	an
28	this
29	or

Word	In GoogleNews
a	False
A	True
an	True
An	True
the	True
The	True
about	True
About	True
from	True
From	True
in	True
In	True
of	False
Of	True
to	False
To	True
out	True
Out	True
up	True
Up	True
very	True
Very	True

Word	In Glove
a	True
A	False
an	True
An	False
the	True
The	False
about	True
About	False
from	True
From	False
in	True
In	False
of	True
Of	False
to	True
To	False
out	True
Out	False
up	True
Up	False
very	True
Very	False