Distributional and frequency effects in word embeddings: Distributional effects and hubs¶

© 2018 Chris Culy, June 2018¶

chrisculy.net ¶

Overview¶

This is one of a series of posts. In this post I examine the issue of "hubs": words/vectors that are very similar to many other words/vectors. While the existence of hubs is a mathematical property of (certain) vector spaces, I will explore them with regard to whether the hubs that arise in word embeddings have any patterns that are due either to language or to the methods used for creating the word embeddings. One particular result is that hubs show frequency effects; another is that ppmi is very different from the other 3 methods on the very small corpus, heartd.

Results and contribution¶

new word embeddings usually have hubs
- the exception is with the very small corpus
- the exception to the exception is that ppmi does have hubs with the very small corpus
new the hubs can vary (slightly) by run of a method
new the hubs vary by method
new the hubs show frequency effects
- sgns and ft have mostly/only low frequency hubs
- glove has mostly higher frequency hubs, with a spike at the very lowest frequencies
- ppmi has hubs in a range of frequencies, with a spike at the very lowest frequencies
new stratification of similarities is not sufficient to explain the hub frequency effects

Download as Jupyter notebook

Download supplemental Python code

Show Code

#imports
from dfewe import *

#for tables in Jupyter
from IPython.display import HTML, display
import tabulate

# some utilities
def show_title(t):
    display(HTML('<b>%s</b>' % t))

def show_table(data,headers,title):
    show_title(title)
    display(HTML(tabulate.tabulate(data,tablefmt='html', headers=headers)))

#for dynamic links
links = ('<a href="#link%d">Skip down</a>' % i for i in range(100))
anchors = ('<span id="link%d"></span>' % i for i in range(100))

def make_link():
    display(HTML(next(links)))

def make_anchor():
    display(HTML(next(anchors)))

#set up standard corpora + vectors
vfair_all = Setup.make_standard_sampler_and_vecs('vfair',5,100,1) #window=5, dims=100, min_count=1
heartd_all = Setup.make_standard_sampler_and_vecs('heartd',5,100,1) #window=5, dims=100, min_count=1

what = [['Vanity Fair (vfair)'],['Heart of Darkness (heartd)']]
for i,c in enumerate([vfair_all,heartd_all]):
    sampler = c['sampler']
    what[i].extend([sum(sampler.counts.values()), len(sampler.counts)])

show_table(what, headers=['Corpus','Tokens','Types'], title="Corpora sizes")

Background¶

The notion of hubs comes from work on k-nearest neighbor classification, and in fact "Hubness is a phenomenon related specifically to nearest-neighbor methods." [1] Since nearest neighbors are an important aspect of the use of word vectors (for example in word similarity and analogy evaluations), it's reasonable to consider the properties of hubs in word embeddings. [2] discuss hubs as a potential problem for evaluation and propose a mitigation strategy, which I will return to in a subsequent post (TBD).

The fundamental observation is that there are some vectors which are very similar more vectors than most vectors are. These vectors are the hubs.

A key notion is that of $NN_{k}(x)$, which is the number of times vector x is one of the k-nearest neighbors of other data points. I'll extend that notion to make the word vectors explicit, and so we can explore the possibility of frequency effects.

$NN_{k}(x,Y)$, where x is a word vector and Y is a set of word vectors, is the number of times x is one of the k-nearest neighbors of a word vector in Y.

Let's see what $NN_{k}$ looks like for some words. (Throughout this post I'll use k=1000; k=1 is small, k=100 is reasonable, and k>1000 doesn't really add anything.)

def nn_k_for_all(k,words,vecs):
    """
    find nn_k for using the whole vocabulary of vecs, using the similarity of vecs
    """

    others = vecs.vocab.keys()
    return Hubs.nn_k(vecs.similar_by_word,k,words,others)

def example1():
    vecs = vfair_all['sgns']
    words = ['carriage','cart','lady','lord','man','woman']
    
    headers = ['Word','NN<sub>1000</sub>']
    d = sorted(nn_k_for_all(1000,words,vecs), key=lambda x: x[0])
    
    show_table(d, headers, '')
    
example1()

We can see that cart is one of the 1000-nearest neighbors for over 1000 other words while man is one of the 1000-nearest neighbors of fewer than 10 words.

Hubs then are vectors that have a higher than typical $NN_{k}$, like cart in this example.

Since "higher than typical" is a bit vague, it's worth looking at the distribution of the values of $NN_{k}$, and we'll do so by sampling words in percentile bands, and compare them to a sample from all of the vocabulary. (I'm doing the sampling rather than all the comparisons to save time.) Here's the results for the sgns vectors for vfair.

sampler = vfair_all['sampler']
vecs = vfair_all['sgns']
name = 'vfair with sgns'
Hubs.nn_k_by_percentile(sampler,vecs,name,k=1000,max_words=1000,steps=5,words_per_step=2)

The first thing to notice is that there are some words that clearly qualify as hubs. The second thing to notice is the extreme skewing of the distribution by percentile.

Here's a comparison of all 4 methods for vfair. Notice the sampling makes a difference for $NN_k$ values for vfair compared to the above, but only a slight difference for hubs, and the overall trend is the same.

However the main observation to make is that once again, glove and ppmi have very different trends from vfair and ft, with more variation and less clear hubs.

combo = vfair_all
name='vfair'
Hubs.compare_nn_k_by_percentile(combo,name,k=1000,max_words=1000,steps=5,words_per_step=2)

Now let's look at heartd. Unfortunately, it's hard to say much other than all 4 methods show different trends than with vfair. Sampling may well be the culprit here, so we'll look more carefully below.

combo = heartd_all
name='heartd'
Hubs.compare_nn_k_by_percentile(combo,name,k=1000,max_words=1000,steps=5,words_per_step=2)

Hub basics¶

We still need to operationalize the notion of hub. One thing that is immediately obvious from all of the graphs above is that $NN_k$ is not a normal distribution, whatever it is. However, can still use standard deviation as a heuristic for finding hubs: we choose a threshold number of standard deviations and if the $NN_k$ is below that threshold, it does not count as a hub. We can then examine the words above the threshold and then either take them all or the highest scoring ones.

Here are the results for vfair, using 4 standard deviations as the threshold, and showing just the top 10 words.

def show_hubs(sampler,vecs,name,k=1000,threshold=4,topn=10):
    """
    show the hubs with at k_nn of at least threshold standard deviations above the mean. 
    if topn is True, show all of them (with that threshold)
    """
    (m,std,df) = Hubs.find_hubs_with_all(sampler,vecs,k=k,thresh=threshold)

    title = "Hubs for %s with k=%d and threshold of %d standard deviations" % (name,k,threshold)
    show_title(title)
    stitle = "Overall mean: %0.4f &nbsp;&nbsp; Overall std: %0.4f" % (m,std)
    show_title(stitle)
    display(HTML(df[:topn].to_html(index=False)))

def compare_hubs(combo,name,k=1000,threshold=4,topn=10):
    """
    do show_hubs for the different methods in combo
    """
    
    Utils.compare_methods(combo,name,show_hubs,k=k,threshold=threshold,topn=topn)

k = 1000
threshold = 4
topn = 10
compare_hubs(vfair_all,'vfair',k=k,threshold=threshold,topn=topn)

Variability of hubs¶

Variability by run¶

Various researchers, including [2], [3] any my own earlier post, have found that word similarities vary across runs of algorithms like sgns due to their random aspects. Not surprisingly the same is true for hubs, since they are based on similarities.

As an example, here's a listing of the top 20 words for $NN_{1000}$ for vfair with the sgns vectors from three runs of sgns. While there are some differences, they are slight.

def compare_runs(topn=20):
    vfair_all2 = Setup.make_standard_sampler_and_vecs('vfair',5,100,1) #window=5, dims=100, min_count=1
    vfair_all3 = Setup.make_standard_sampler_and_vecs('vfair',5,100,1) #window=5, dims=100, min_count=1

    (m1,std1,df1) = Hubs.find_hubs_with_all(vfair_all['sampler'],vfair_all['sgns'])
    (m2,std2,df2) = Hubs.find_hubs_with_all(vfair_all2['sampler'],vfair_all2['sgns'])
    (m3,std3,df3) = Hubs.find_hubs_with_all(vfair_all3['sampler'],vfair_all3['sgns'])

    title = 'Comparison of hubs across runs of sgns with vfair'
    what = '<table><tr><th>run 1</th><th>run 2</th><th>run 3</th></tr>'
    what += '<tr><td>' + df1[:topn].to_html(index=False) + '</td>'
    what += '<td>' + df2[:topn].to_html(index=False) + '</td>'
    what += '<td>' + df3[:topn].to_html(index=False) + '</td></tr></table>'

    show_title(title)
    display(HTML(what))

compare_runs(20)

Variability by method¶

We can also compare the hubs across the different methods for vfair. What we find is that each method leads to different hubs.

def compare_hub_words(combo,name,k=1000,thresh=4,topn=20):
    """
    compare which words are hubs, among the topn
    """

    sampler = combo['sampler']
    
    vs = ['sgns','ft','glove','ppmi']
    vhubs = dict()

    for v in vs:
        vecs = combo[v]
        (m,std,df) = Hubs.find_hubs_with_all(sampler,vecs,k=k,thresh=threshold)
        if v == 'glove':
            df = df.replace('<unk>', '&lt;unk&gt;') #rename <unk> so it appears in html
        vhubs[v] = set(list(df[:topn]['word']))
        
    show_title("Comparison of top %d hubs for %s, k=%d, threshold=%d stds" % (topn,name,k,thresh))

    maxn = 0
    in_all = set()
    in_some = set()
    for h in vhubs.values():
        in_some |= h
        in_all &= h
        maxn = max(maxn,len(h))
        
    combon = len(in_all)
    
    
    show_title('Distinct hubs: %d Combined overlap: %d' % (len(in_some), combon))
    
    if combon == 0:
        display(HTML('<em>There are no hubs in common among the top %d hubs for each method</em>' % maxn))    
    else:
        if combon == maxn:
            display(HTML('<em>There is complete overlap among the top %d hubs for each method</em>' % maxn))    
        display(HTML('<p>' + '<br>'.join(sorted(list(in_all))) + '</p>'))
            
    if combon < maxn:
        #pairwise comparison
        for v1 in vhubs:
            for v2 in vhubs:
                if v1==v2:
                    break

                what = []
                
                overlap = vhubs[v1] & vhubs[v2]
                first_not_second = vhubs[v1] - vhubs[v2]
                second_not_first = vhubs[v2] - vhubs[v1]
                what += [[', '.join(sorted(list(overlap))),
                         ', '.join(sorted(list(first_not_second))),
                         ', '.join(sorted(list(second_not_first)))]]

                headers = ["%s and %s overlap: %d" % (v1,v2, len(overlap)),
                            "%s but not %s: %d" % (v1,v2, len(first_not_second)),
                            "%s but not %s: %d" % (v2,v1, len(second_not_first))]

                show_table(what,headers,'')

k = 1000
threshold = 4
topn = 50
make_link()
compare_hub_words(vfair_all,'vfair', k=k,thresh=threshold,topn=topn)
make_anchor()

Repeating the same comparison with heartd (but using 3 standard deviations as our threshold), we again find very different hubs across the methods.

k = 1000
threshold = 3 #NB 3 instead of 4
topn = 50
make_link()
compare_hub_words(heartd_all,'heartd', k=k,thresh=threshold,topn=topn)
make_anchor()

Variability by frequency¶

We can also examine the role of word frequency for hubs. One aspect is to see if different frequency bands have different hubs. In fact, they do, with some hubs overlapping across the different frequency bands. Here's what that looks like for vfair.

def show_hubs_for_band(sampler,vecs,name,k=1000,threshold=2,step=5,topn=20,show_full=True):
    """
    use each band of width step of the vocabulary as others; use whole vocab as potentilas
    i.e. this looks for hubs that are _for_ the bands
    """
    
    what = Hubs.find_hubs_for_band(sampler,vecs,k=k,thresh=threshold,step=5)
    
    counts = Counter()
    same_bands = 0
    num_hubs = 0
    for i,(m,std,df_) in enumerate(what):
        df = df_[:topn].copy()
        counts.update(list(df['word']))
        df['same band'] = (df['percentile'] >= i) & (df['percentile'] <= i+step)
        same_bands += len(df[df['same band']])
        num_hubs += len(df)
        if show_full: 
            lname = "%s, hubs for range percentiles %d to %d" %(name, i,i+step)
            title = "Hubs for %s with k=%d and threshold of %d standard deviations" % (lname,k,threshold)
            show_title(title)
            stitle = "Overall mean: %0.4f &nbsp;&nbsp; Overall std: %0.4f" % (m,std)
            show_title(stitle)
            if len(df[df['word'] == '<unk>']) > 0:
                df = df.replace('<unk>', '&lt;unk&gt;') #rename <unk> so it appears in html (for glove)
            display(HTML(df.to_html(index=False)))

            
    if num_hubs > 0:
        d = [[w,sampler.get_percentile(w),c] for (w,c) in counts.most_common()]
        show_table(d,['Hub','Percentile','Number of bands'], 'Hubs and the number of bands they occured in for %s' % name)
        
        show_title('Distribution of hubs by percentile for %s' %name)
        pcounts = Counter()
        for (_,p,c) in d:
            pcounts.update({p:c})
        pdata = [pcounts[i] if i in pcounts else 0 for i in range(0,101)]
        
        fig, ax = plt.subplots(figsize=(10, 2))
        
        ax.bar(np.arange(0,101), pdata, color='orange')
        ax.set_xticks(np.arange(0,101,10))
        ax.set_xlabel('percentile')
        ax.set_ylabel('count')
       

        ax.spines["top"].set_visible(False)
        ax.spines["right"].set_visible(False)
        
        plt.show()
    
        show_title('Across the bands, %d of %d (= %0.2f) from among the %d hubs in each were in the target band' % 
               (same_bands, num_hubs, same_bands/num_hubs, topn))
    else:
        show_title('There are no hubs in any of the bands in %s at threshold = %0.2f' % (name, threshold))
    
def compare_hubs_for_band(combo,name,k=1000,threshold=2,step=5,topn=20,show_full=False):
    """
    do show_hubs_for_band for each method in combo
    """
    
    Utils.compare_methods(combo,name,show_hubs_for_band,k=k,threshold=threshold,step=step,topn=topn,show_full=show_full)

name = 'vfair, with sgns'
sampler = vfair_all['sampler']
vecs = vfair_all['sgns']
k = 1000
thresh = 4
step = 5
topn = 10
make_link()
show_hubs_for_band(sampler,vecs,name,k=k,threshold=thresh,step=step,topn=topn, show_full=False)
make_anchor()

The obvious thing to notice is that the hubs are mostly relatively low frequency words (10th percentile or lower). As an additional consequence, only about 1/3 of the hubs were in the band being compared.

We can compare sgns with the other methods for vfair.

name = 'vfair'
combo = vfair_all
k = 1000
threshold = 4
step = 5
topn = 10
make_link()
compare_hubs_for_band(combo,name,k=k,threshold=threshold,step=step,topn=topn, show_full=False)
make_anchor()

There are a couple striking differences across the methods. The first is that sgns and ft show similar patterns, with hubs being primarily relatively low frequency words. glove and ppmi have spikes at the lowest frequency words, but the rest of the hubs in glove are spread out among the higher frequency words, while with ppmi the hubs occur more or less across the spectrum.

The other difference is the proportion of hubs occuring in the band being compared, from 0.31 for sgns to 0.12 for ft, down to 0.06 and 0.05 for glove and ppmi respectively.

We can check to see if the same patterns hold for heartd. They don't — rather there's a massive breakdown, except for ppmi. There are no hubs for sgns and ft, and only 1 for glove.

name = 'heartd'
combo = heartd_all
k = 1000
threshold = 3
step = 5
topn = 10
make_link()
compare_hubs_for_band(combo,name,k=k,threshold=threshold,step=step,topn=topn, show_full=False)
make_anchor()

Discussion¶

It seems like there should be a connection between the frequency effects with hubs and other frequency effects we've seen. However, the connection isn't straightforward, especially for heartd.

We saw in the stratification post that:

for sgns and ft, frequency is inversely related to similarity
for glove and ppmi frequency is directly related to similarity

Starting with sgns and ft, two low frequency words are more similar than two high frequency words. Since there are lots of low frequency words, that might explain why we get the hubs in the lower frequency words. Glove works the opposite: two high frequency words are more similar than two low frequency words, and we get high frequency hubs. The spike in hubs at the lowest frequencies corresponds to the anomalous cell in the stratification, where the lowest frequency words are more similar to each other than even slightly more frequent words.

Nice so far. Unfortunately, ppmi doesn't follow the pattern. It shows similar stratification to glove, but its hubs, as we saw, are across the board.

Heartd also only partially follows the pattern. sgns and ft are both extremely stratified, which might explain why there are no hubs: everything is close to everything else. However, glove and ppmi are more like vfair for stratification, but glove has only 1 hub in heartd, while ppmi has many.

Thus, even though the stratification of similarities may be relevant for understanding hubs, it is clearly not sufficient.

Back to the introduction

The posts¶

References¶

[1] Nenad Tomašev, Milos Radovanovic, Dunja Mladenić, and Mirjana Ivanovic. 2011. A probabilistic approach to nearest-neighbor classification: Naive hubness Bayesian kNN. Proceedings of the International Conference on Information and Knowledge Management. 2173-2176.

[2] Johannes Hellrich and Udo Hahn. 2016. Bad Company—Neighborhoods in Neural Embedding Spaces Considered Harmful. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 2785–2796, Osaka, Japan, December 11-17 2016.

[3] Maria Antoniak and David Mimno. 2018. Evaluating the Stability of Embedding-based Word Similarities. In Transactions of the Association for Computational Linguistics, vol. 6, pp. 107–119.

word	percentile	nn_k	# stds
rickety	0	4751	5.058134
bumpers	1	4706	4.997453
x	1	4520	4.746636
weighed	1	4435	4.632016
moaning	1	4433	4.629319
bravely	1	4289	4.435138
charmante	0	4278	4.420305
dolly	1	4270	4.409517
backed	1	4266	4.404123
radical	0	4214	4.334003

word	percentile	nn_k	# stds
tunic	0	5280	5.616021
defunct	1	5201	5.512361
telegraphic	0	4896	5.112154
spaniel	1	4896	5.112154
energetic	0	4798	4.983563
suburbs	0	4760	4.933701
preliminary	0	4720	4.881215
enthusiasm	3	4715	4.874654
telegraph	0	4692	4.844474
apologise	0	4668	4.812982

word	nn_k	# stds
<unk>	6305	7.219859
alge	5742	6.453642
bobbins	5434	6.034468
westwards	5430	6.029024
contrasts	5428	6.026302
velvets	5355	5.926953
sneaking	5353	5.924231
marine	5315	5.872515
privateer	5293	5.842574
cannibals	5284	5.830325

word	percentile	nn_k	# stds
and	99	2866	9.745302
palatinate	0	2478	7.718948
jackals	0	2358	7.092240
flaring	0	2297	6.773664
legion	0	2275	6.658768
instituted	0	2271	6.637877
mustering	0	2267	6.616987
exploding	0	2256	6.559539
outsides	0	2244	6.496868
enhanced	0	2216	6.350637

Hub	Percentile	Number of bands
effect	9	9
generosity	4	8
tended	4	7
unfortunate	5	7
belonged	3	5
permitted	2	5
hearing	7	5
value	4	5
portrait	5	4
absolutely	3	4
surprise	6	4
coolness	3	4
born	8	4
consolation	4	4
abominable	4	4
price	6	3
game	7	3
champion	4	3
parted	5	3
dying	4	3
struck	6	3
intelligence	5	3
noise	4	3
trying	5	2
jim	6	2
alarm	6	2
paying	6	2
communicate	2	2
attentive	3	2
forced	9	2
presents	5	2
listened	6	2
report	3	2
disposed	6	2
personally	4	2
instructed	3	2
touch	4	2
perform	3	1
finished	4	1
charmante	0	1
anger	6	1
anxious	9	1
exactly	3	1
confessed	4	1
quietly	4	1
engage	3	1
remembering	2	1
assumed	3	1
prepared	6	1
civil	4	1
rickety	0	1
altogether	7	1
civilian	7	1
beat	6	1
slept	4	1
parting	8	1
disappointment	4	1
backed	1	1
probably	6	1
hinted	3	1
travel	2	1
questions	4	1
weighed	1	1
airs	7	1
cupid	2	1
consent	3	1
escaped	3	1
x	1	1
fun	4	1
bravely	1	1
beating	5	1
sigh	4	1
therefore	4	1
informed	6	1
treat	3	1
acknowledged	3	1
vauxhall	8	1
parliament	6	1
circumstances	6	1
sold	7	1
liking	3	1
dislike	3	1
strange	5	1
shock	4	1
frank	4	1
occasionally	4	1
act	11	1
entreaties	3	1
warning	4	1
fear	8	1
recovered	3	1
shame	6	1
force	5	1
restored	4	1
bumpers	1	1
dolly	1	1
lest	4	1
managed	3	1
gleams	0	1
quit	5	1
employed	4	1
nevertheless	4	1
immediately	6	1
deserted	2	1
moaning	1	1
event	7	1

Corpus	Tokens	Types
Vanity Fair (vfair)	310722	15803
Heart of Darkness (heartd)	38897	5420

Hub	Percentile	Number of bands
pluck	2	5
guilbert	0	4
cruel	9	4
obey	1	3
beggary	0	3
heel	0	3
debt	5	3
deum	0	3
method	0	3
kneel	0	3
sindbad	0	3
trustee	0	2
frighten	3	2
delay	2	2
geliebt	0	2
ferry	0	2
purpose	4	2
justify	0	2
exit	0	2
temporise	0	2
mein	0	2
vit	0	2
mammas	0	2
depict	1	2
rebuke	1	2
donor	0	2
reclaim	0	2
judge	3	2
esprit	0	2
morose	0	2
decease	0	2
nincompoop	0	2
joy	4	2
precipitancy	0	2
add	2	2
mauvaise	0	1
defy	2	1
abominably	1	1
deservedly	0	1
yesterday	9	1
calmly	1	1
budgebudge	0	1
explain	1	1
regret	3	1
purveyor	0	1
hopeful	0	1
newcomer	0	1
rebuild	0	1
needlework	0	1
exact	0	1
regency	0	1
split	0	1
methodist	0	1
unwieldily	0	1
disinherit	0	1
obstinately	0	1
reel	0	1
recreant	0	1
promptly	0	1
absolutely	3	1
minor	3	1
oftener	1	1
tunic	0	1
mope	0	1
deprecate	0	1
morsel	0	1
suburbs	0	1
sophy	0	1
myth	0	1
telegraphic	0	1
whatdyecallum	0	1
defunct	1	1
mad	3	1
refrain	0	1
culprit	0	1
telegraph	0	1
mimicry	0	1
defrays	0	1
supreme	1	1
recoil	0	1
prague	1	1
reglar	0	1
luck	7	1
begun	3	1
faugh	0	1
pye	0	1
heavens	4	1
melody	0	1
incompetency	0	1
maxim	0	1
wolsey	0	1
cruelly	1	1
mistrust	0	1
rejoin	0	1
preliminary	0	1
albeit	0	1
dieu	1	1
insular	0	1
rhapsody	0	1
undexterously	0	1
finish	3	1
vouchsafe	0	1
economist	0	1
surreptitiously	1	1
defend	3	1
donkey	0	1
madly	1	1
cruelty	2	1
murderer	0	1
forgot	5	1
enjoy	3	1
alas	3	1
surmise	0	1
begone	0	1
careful	0	1
apologise	0	1
mayor	0	1
speedy	2	1
git	1	1
forgery	0	1
devereux	0	1
unluckily	0	1
judah	0	1
whatdyecallem	0	1
munoz	0	1
observer	1	1
unfortunate	5	1
g	4	1
dee	0	1
justice	6	1
accept	5	1
afford	2	1
forbid	1	1
energetic	0	1
m	4	1
hardy	0	1
enthusiasm	3	1
spaniel	1	1
impulse	1	1
sometime	0	1
dye	0	1
sorry	4	1
circuit	0	1
luckily	1	1
don	2	1
defray	0	1
moin	0	1
asleep	5	1

Hub	Percentile	Number of bands
kartoffeln	0	13
toute	0	11
and	99	10
the	100	9
of	99	8
it	94	5
to	99	5
amelia	81	5
when	91	5
her	98	5
braten	0	5
crawley	89	5
but	91	5
thought	69	4
he	97	4
teething	0	3
's	96	3
a	99	3
if	81	3
this	89	3
him	93	3
aussi	0	2
were	89	2
one	84	2
so	88	2
i	95	2
said	92	2
have	91	2
such	74	2
osborne	81	2
all	90	2
was	98	2
in	98	2
miss	89	2
as	95	2
little	90	2
people	55	1
his	97	1
with	96	1
contrasts	0	1
velvets	0	1
for	94	1
pitt	76	1
about	82	1
window	17	1
there	86	1
they	87	1
major	70	1
marine	0	1
which	94	1
briggs	56	1
after	77	1
made	75	1
at	95	1
husband	53	1
mr	85	1
alge	0	1
by	91	1
love	51	1
had	96	1
you	94	1
she	96	1
day	73	1
some	75	1
be	92	1
bobbins	0	1
we	80	1
put	50	1
not	93	1
before	73	1
midst	7	1
my	88	1
sneaking	0	1
westwards	0	1
now	68	1
privateer	0	1
cannibals	0	1
over	76	1
knew	43	1
stupidest	0	1
here	56	1
who	92	1
boy	61	1
	0	1
that	97	1
whom	64	1
would	87	1

Hub	Percentile	Number of bands
though	54	8
told	46	7
always	59	6
course	43	6
asked	44	5
present	35	5
this	89	4
now	68	4
being	49	4
once	51	4
everything	30	4
wife	61	3
for	94	3
were	89	3
rebecca	78	3
only	71	3
known	22	3
father	63	3
be	92	2
very	86	2
it	94	2
indeed	49	2
not	93	2
becky	72	2
ordered	15	2
husband	53	2
used	47	2
come	70	2
man	77	2
rawdon	79	2
when	91	2
before	73	2
dobbin	79	2
emmy	49	2
him	93	2
boy	61	2
who	92	2
own	71	2
little	90	2
would	87	2
them	80	2
too	69	1
brother	48	1
with	96	1
adjustment	0	1
nobody	16	1
mustering	0	1
should	65	1
coming	25	1
about	82	1
money	59	1
if	81	1
they	87	1
the	100	1
how	82	1
flaring	0	1
major	70	1
after	77	1
pretty	39	1
made	75	1
point	14	1
mrs	86	1
there	86	1
briggs	56	1
name	31	1
and	99	1
or	88	1
such	74	1
alone	23	1
do	68	1
woman	61	1
first	55	1
jackals	0	1
instituted	0	1
george	83	1
then	61	1
take	58	1
near	14	1
comfort	14	1
well	62	1
pleasure	28	1
legion	0	1
was	98	1
look	51	1
palatinate	0	1
saying	12	1
outsides	0	1
must	64	1
quite	57	1
a	99	1
thought	69	1
school	29	1
again	43	1
been	84	1
duty	23	1
everybody	35	1
had	96	1
much	74	1
came	72	1
more	76	1
liked	20	1
as	95	1
back	60	1
lady	85	1
on	93	1
her	98	1
exploding	0	1
reason	9	1
said	92	1
enhanced	0	1
gave	51	1
other	71	1
an	85	1
time	65	1
story	23	1
way	55	1
never	75	1

Hub	Percentile	Number of bands
all	78	5
that	90	5
had	92	5
to	95	5
in	93	5
we	72	4
it	91	4
well	41	4
my	82	4
of	98	4
keep	17	4
only	43	3
not	82	3
i	98	3
and	96	3
he	92	3
but	80	3
perhaps	23	3
a	97	2
which	48	2
thing	28	2
me	83	2
the	99	2
him	81	2
made	49	2
at	84	2
knew	17	2
as	86	2
man	64	2
you	87	2
seemed	50	2
would	71	2
best	8	1
shutter	10	1
clear	12	1
into	55	1
his	88	1
very	69	1
out	73	1
by	72	1
were	74	1
who	45	1
there	79	1
came	48	1
river	45	1
their	57	1
one	72	1
lugubrious	0	1
could	62	1
lost	21	1
with	89	1
was	94	1
wanted	21	1
after	42	1
enough	27	1
mr	41	1
they	76	1
's	62	1
said	70	1
oh	22	1
like	65	1
am	29	1
last	37	1
upon	44	1
say	39	1
this	77	1
being	28	1
know	58	1
pilgrims	25	1
first	31	1
true	11	1
nobly	0	1
manager	37	1
looking	21	1
did	58	1
will	30	1
getting	7	1
think	31	1
better	9	1
some	62	1
saw	38	1
brick	0	1

Word	NN₁₀₀₀
carriage	124
cart	1275
lady	8
lord	27
man	5
woman	17