Word vectors with small corpora:

Visualizing word vectors

© 2018 Chris Culy, April 2018



This is one of a series of posts on using word vectors with small corpora. This post goes beyond small corpora, however, to discuss approaches to visualizing word vectors for any size corpus.

Download as Jupyter notebook

Show Code

In [2]:
import numpy as np
from gensim import models

#for tables in Jupyter
from IPython.display import HTML, display
import tabulate

#for visualization
import math
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA as sPCA
from sklearn import manifold #MSD, t-SNE


Visualizing data often helps us understand it better, whether as an overview or at a more detailed level. Word vector models pose several significant problems for visualization, however.

The first problem has to do with the size of the vocabulary. Even with these small corpora, the size of the vocabulary makes it difficult to visualize it all at once. (The smallest text in the previous experiment, Nightmare Abbey, has just under 5,000 types, while the largest text, Moby Dick, has almost 18,000 types).

The second problem has to do with the number of dimensions of the vector spaces. Our test parameters range from 25 to 400. Given that we visualize information in two or three dimensions (perhaps with time as a fourth), we have many more dimensions of information than we have available for the visualization. Even standard visualization encoding techniques (color, shape, etc.) give us just a few more dimensions to use, and their expressibility is relatively limited.

The third problem has to do with the nature of the axes of the vector spaces. For most vector space models, including the three discussed in this series of posts (ppmi_svd, word2vec, and FastText), the axes do not have any inherent meaning. This lack of meaning makes it difficult to use those axes in a visualization.

To address the problem of the size of the vocabulary, we can note that for most of the tasks involving word vectors, we are not interested in the whole vocabulary at once. For example, we might be interested in the closest items to a relatively small set of words. I will thus set aside the issue of vocabulary size, and focus on the number of dimensions and interpretatibily issues.

In the examples here, I will use a centroid word2vec model for Vanity Fair that we created in the post on stabilizing randomness. The parameters are win=10, dim=100, min_count=10. There is nothing special about these parameters, nor about the use of word2vec — other parameters and other techniques would work just as well.

In [3]:
vf = 'vanity_fair_pg599.txt-sents-clean.txt-word2vec-win10-dim100-thresh10.vecs'
vecs = models.KeyedVectors.load(vf)

Dimensions and similarity

Similarity lines

Probably the most common use of visualization of word vectors is to get a sense of how similar words are. In the post on exploring similarities, we used one (uncommon) technique for visualizing similarties, namely plotting rank vs. similarity. (We then used the slope of those lines for further analysis, but that is not the point here.) Below we see the similarity line for the 10 words most similar to 'house'.

In [4]:
def show_closest_line(vecs,word,n):
    display(HTML("<b>%d words most similar to '%s'</b>" % (n,word)))
    tops = vecs.similar_by_word(word, topn=n, restrict_vocab=None)
    items = [item[0] for item in tops]
    sims = [item[1] for i,item in enumerate(tops)]
    fig = plt.figure(num=None, figsize=(10, 6), dpi=80, facecolor='w', edgecolor='k')
    ax = fig.add_subplot(111)

    plt.xticks(range(n), [i+1 for i in range(n)])

    ax.plot(sims, color="purple", alpha=0.5)
    for item, x, y in zip(items, range(n), sims):
        ax.annotate( item, xy=(x, y), xytext=(20, -7), textcoords='offset points', 
                     ha='right', va='bottom', color='orange', fontsize=14 )

In [5]:
10 words most similar to 'house'

A large advantage of similarity lines is that they are easily interpretable: rank and similarity are familiar ideas and they are encoded straightforwardly by position (horizontal and vertical, respectively).

The main disadvantage of similarity lines is that show us only one aspect of similarity, namely similarity to a single given word ('house' in this example). We cannot tell anything about the similarity of these words to each other.

To see why this is so, consider a geographic example. The distance from London, England to the closest European capital, Paris, France, is 460km, while the distance from London to the second closest European capital, Dublin, Ireland, is 557km. However, the distance from Paris to Dublin is not 97km (557-460), but rather 1024km.

Dublin London Paris

The problem is that in both the geographic example and word vectors, distance/similarity is one dimensional but the data is multi-dimensional (location involves orientation, not just distance), so some information is lost when representing the original data with a single dimension.

Dimensionality reduction techniques

A common technique when dealing with high dimensionsal data is to reduce the number of dimensions by transforming the data into a lower number of new dimensions for visualization. The number of dimensions is usually two, sometimes three, given our two dimensional displays. There are many ways to do dimensionality reduction, each with their own goals and motivations, but I will focus on three common techniques, using just two dimensions for illustration.

  • Principal Components Analysis (PCA)
  • Multidimensional Scaling (MDS)
  • t-distributed Stochastic Neighbor Embedding (t-SNE)

It is important to note that these techniques are mathematical transformations and not visualization techniques: the results of the transformations are used for visualization. In fact, the function I wrote below to illustrate these techniques calculates the transformation in each case, and then uses exactly the same code for the visualizations.

An additional point is that the new dimensions of the transformed data do not have any inherent meaning relative to the original data, which is why there are no scales in the charts below.

In [25]:
def show_closest_2d(vecs,word,n,method):
    tops = vecs.similar_by_word(word, topn=n, restrict_vocab=None)
    display(HTML("<b>%d words most similar to '%s' (%s)</b>" % (n,word, method)))
    #display(HTML(tabulate.tabulate(tops, tablefmt='html', headers=[])))

    items = [word] + [x[0] for x in tops]

    wvecs = np.array([vecs.word_vec(wd, use_norm=True) for wd in items])

    if method is "PCA":
        spca = sPCA(n_components=2)
        coords = spca.fit_transform(wvecs)
        #print('Explained variation per principal component:', spca.explained_variance_ratio_, "Total:", sum(spca.explained_variance_ratio_))
    elif method is "tSNE":
        tsne = manifold.TSNE(n_components=2)
        coords = tsne.fit_transform(wvecs)
        #print("kl-divergence: %0.8f" % tsne.kl_divergence_)
    elif method == "tSNE-PCA":
        tsne = manifold.TSNE(n_components=2, init='pca')
        coords = tsne.fit_transform(wvecs)
        #print("kl-divergence: %0.8f" % tsne.kl_divergence_)
    elif method is "MDS":
        dists = np.zeros((len(items), len(items)))
        for i,item1 in enumerate(items):
            for j,item2 in enumerate(items):
                dists[i][j] = dists[j][i] = vecs.distance(item1,item2)
        mds = manifold.MDS(n_components=2, max_iter=3000, eps=1e-9, random_state=0, dissimilarity="precomputed", n_jobs=1)
        coords = mds.fit(dists).embedding_
        #print("Stress is %0.8f" % mds.stress_)

        raise ValueError("Invalid method: %s" % method) 

    plt.figure(num=None, figsize=(8, 8), dpi=80, facecolor='w', edgecolor='k')

    lim = max([abs(x) for x in coords[:,0] + coords[:,1]])
    plt.scatter(coords[2:,0], coords[2:,1])
    plt.scatter(coords[0:1,0], coords[0:1,1], color='black')
    plt.scatter(coords[1:2,0], coords[1:2,1], color='orange')
    for item, x, y in zip(items[2:], coords[2:,0], coords[2:,1]):
        plt.annotate( item, xy=(x, y), xytext=(-2, 2), textcoords='offset points', 
                     ha='right', va='bottom', color='purple', fontsize=14 )

    plt.annotate( word , xy=(x0, y0), xytext=(-2, 2), textcoords='offset points', 
                 ha='right', va='bottom', color='black', fontsize=16 )
    plt.annotate( items[1] , xy=(x1, y1), xytext=(-2, 2), textcoords='offset points', 
                 ha='right', va='bottom', color='orange', fontsize=14 )

    ax = plt.gca()
    r = math.sqrt( (x1-x0)**2 + (y1-y0)**2 )
    circle = plt.Circle((x0, y0), r, color='orange', fill=False)


Principal Components Analysis (PCA)

The goal of PCA is to transform the original data into a representation using fewer, independent dimensions such that each successive dimension maximizes the variance of the information encoded in that new axis. Here is an example showing the 10 words most similar to 'house' in this word2vec model.

In [26]:
10 words most similar to 'house' (PCA)

We saw in the similarity line example above, that 'lane' is the word most similar to 'house' in this vector space. However, in the new PCA dimensions, 'lane' is not the closest to 'house', but 'admitted' is. In addition, a second word 'hampshire' (Hampshire) is also closer to 'house' than 'lane' is. This discrepancy is due to the fact that PCA does not preserve distances (or even have that as a goal). This mismatch between similarity and the two-dimensional representations of the vector space is an inherent one: it is not possible to preserve all the distances from a higher dimensional space in a lower one.

Multidimensional Scaling (MDS)

Unlike PCA, MDS does try to preserve distances. Here is what the 10 words most similar to 'house' look like under MDS. The word 'admitted' is still closer to 'house' than 'lane', but only barely, and it is the only word closer than 'lane'.

In [27]:
10 words most similar to 'house' (MDS)

t-distributed Stochastic Neighbor Embedding (t-SNE)

t-SNE also tries to preserve distances. Where MDS tries to preserve "close" distances, t-SNE tries to preseve both "close" and "far" distances. Here is what the 10 words most similar to 'house' look like under t-SNE.

In [28]:
10 words most similar to 'house' (tSNE)

Since t-SNE is stochastic, it can give different results every time it is used with exactly the same data and parameters. Here is another output.

In [29]:
10 words most similar to 'house' (tSNE)

However, it is also possible to combine t-SNE with PCA and have a (more) stable outcome. Here is an example of the same 10 words most similar to 'house'. Note that 'park' is roughy the same distance from 'house' as 'lane' even though 'lane' is the closet word to 'house' and 'park' is the 6th-closest.

In [30]:
10 words most similar to 'house' (tSNE-PCA)

Evaluating dimensionality reduction techniques

The main advantage of using dimensionality reduction is that the results can give us an idea of groupings of words in a way that the similarity lines cannot. However, given that it is in general impossible to preserve all of the distances in a high dimensional space when we reduce it down to a lower number of dimensions, the groupings we see in the reduced version must be taken with a grain of salt, as can be seen from the differences across the different techniques. The choice of dimensionality reduction technique thus depends on external factors, such as speed. It might also be prudent to try more than one, as we have done here.

Incidentally, this issue with dimensionality reduction raises the issue of the appropriateness of the visualizations commonly used to show language change using word vectors (e.g. [1], [2], [3], since they use dimensionality reduction to create the two-dimensional visualizations.

Visualizing the vectors

The previous section was concerned with visualizing words to see similarities. The actual values of the word vectors were not relevant. In this section, we will explore a couple ways to visualize the values.

However, we need to be careful in how we use these visualizations, since dimensions do not have any inherent meaning. The most we can do is use these visualizations to compare vectors, to see which components have similar values and which don't.

For the purposes of these visualizations, I am using normalized vectors (so the length of each vector is 1. This is also true of the preceding visualizations, but it was not relevant, since we were dealing with similarities of whole vectors, not the components of the vectors.

Encoding with color

One simple way to visualize the components of vectors is to assign each value a color, and then show them as shapes along a horizontal axis, representing the dimensions. In the first example below, we'll use 'house' and some of its most similar words, from the example above.

In [12]:
def compare_words_with_color(vecs,wds):
    wdsr = wds[:]
    display(HTML('<b>Word vectors for: %s</b>' % ', '.join(wdsr)))
    vs = [vecs.get_vector(wd) for wd in wds]
    dim = len(vs[0])
    fig = plt.figure(num=None, figsize=(12, 2), dpi=80, facecolor='w', edgecolor='k')
    ax = fig.add_subplot(111)
    for i,v in enumerate(vs):
        ax.scatter(range(dim),[i]*dim, c=vs[i], cmap='Spectral', s=16)
    #plt.xticks(range(n), [i+1 for i in range(n)])
    plt.yticks(range(len(wds)), wds)
In [13]:
words_house = ['house','lane','street','hulker']
Word vectors for: house, lane, street, hulker

Impressionistcally, there is a lot of similartiy in the values of the components, not surprisingly. On the other hand, when we look at words that are semantically related ('woman','girl','lady','man'), it's harder to see similarities is the values, though dimension 60 stands out.

In [14]:
words = ['woman','girl','lady','man']
In [15]:
Word vectors for: woman, girl, lady, man

Encoding as polylines

In the color encoding technique, the dimensions are represented on the horizontal axis, while the values are represented by color. An alternative for the values is the represent them on the vertical axis, connecting the components of a single vector by line segments (a polyline). This technique borrows from the parallel coordinates visualization. Once again, the difference here is that unlike typical uses of parallel coordinates, our components/axes do not have any inherent meaning.

Here's a version in parallel coordinates style, with all the vectors represented in the same chart.

In [16]:
def compare_words_polyline(vecs,wds,combined=True):
    display(HTML('<b>Word vectors for: %s</b>' % ', '.join(wds)))
    vs = [vecs.get_vector(wd) for wd in wds]
    dim = len(vs[0])
    nseries = len(wds)

    colormap = plt.cm.tab20b
    colors = [colormap(i) for i in np.linspace(0, 1, nseries)]

    if combined:
        fig = plt.figure(num=None, figsize=(12, 6), dpi=80, facecolor='w', edgecolor='k')
        ax = fig.add_subplot(111)

        for i,v in enumerate(vs):
            ax.plot(v, label=wds[i], c=colors[i])

        fig, axarr = plt.subplots(nseries+1, sharex=True, sharey=True, figsize=(12, 2+nseries), dpi=80, facecolor='w', edgecolor='k')

        for i,v in enumerate(vs):
            axarr[i+1].plot(v, label=wds[i], c=colors[i])

In [17]:
words2 = words[:]
In [18]:
Word vectors for: woman, girl, lady, man

However, I find these examples easier to understand if each vector is own chart, as in the following:

In [19]:
Word vectors for: woman, girl, lady, man
In [20]:
words_house2 = words_house[:]
compare_words_polyline(vecs, words_house, combined=False)
Word vectors for: hulker, street, lane, house

One potential drawback to the polyline approach is that we tend to interpret the lines as representing an ordered sequence, while the dimensions have no inherent order (in addition to having no inherent meaning). As long we we keep this caveat in mind, the polylines are a reasonable way to compare vectors.

While the choice of technique for visualizing the vector component values is largely a personal preference, the polyline approach allows for a finer grained comparison, while the color encoding approach perhaps gives a more holistic impression.

Discussion and Conclusion

While visualization can be a powerful tool in understanding data, we have seen that there is no ideal technique. Each one has its advantages and disadvantages, and we have to be careful in understanding what they represent, and particularly in the case of dimensionality reduction, the lack of complete fidelity of distances to the original.

Finally, we can note that since we are not trying to consider all the vocabulary at once, or even large parts of it, these visualization techniques can be applied to any word vector model.

Back to the introduction

Other posts


[1] Vivek Kulkarni, Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. 2014. Statistically significant detection of linguistic change. In Proc. 24th WWW Conf., pp. 625–635. International World Wide Web Conferences Steering Committee.

[2] William L. Hamilton, Jure Leskovec, and Dan Jurafsky. 2016. Diachronic word embeddings reveal historical laws of semantic change. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

[3] Terrence Szymanski. 2017. Temporal Word Analogies: Identifying Lexical Replacement with Diachronic Word Embeddings. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Short Papers), pp. 448–453.