Word embeddings and their visualizations

© 2018 Chris Culy

chrisculy.net

This site allows you to create a variety of simple word embeddings, as a way to compare how these word embeddings work with small texts (a single book or so). The way to compare this is by a series of visualizations, both of the words vectors themselves, and of the results of the "most similar" and "analogy" comparisons. Visualizing these comparisons is not typical, but I find it useful to get a better sense of the behaviors. The "analogy" comparisons also let you choose different ways of making comparisons and what is included.

For both creating and visualizing word embeddings, several examples are provided so you can get going right away. The visualizations can also be used with word embeddings calculated in other programs, and in fact, one of the examples are from the Stanford GLoVe embeddings.

Notes about processing

All the calculations are done in javascript on your computer — no files are uploaded. This means that creating word vectors should probably be limited to small to medium texts, such as one or a few books.

The SVD option is recommended, but it is fairly slow. For example, with a minimum word count of 5, the Wizard of Oz books took about 5 minutes to finish using Firefox on my laptop, Frankenstein took about 45 minutes, and the Three Musketeers (not included here) took about 4 hours. Note that Safari is 3 or more times as fast as Safari in calculating SVD, while Chrome is somewhere in between.

To speed up the dimensionality reduction of SVD, this tool allows you to do random projection first, to do an initial quick dimensionality reduction (say to 5-10 times the final number dimentons dimensions — but less than the number of vocabulary items). Then SVD can be done in a second step to get the desired number of dimensions. Doing random projection before SVD gives similar, though not identical, results to doing SVD alone. The advantage is speed. For example, instead of 45 minutes, the Frankenstein example took 10 minutes with random projection (and only 3 minutes in Safari instead of Firefox). See the blog post (TBD) for more information.

Loading pre-calculated vectors is a better: loading the included 100,000 GLoVe vectors takes just a few seconds. However, memory will be an issue with large numbers of vectors and/or dimensions.

Acknowledgments:

Texts are from Project Gutenberg
Stanford GLoVe embeddings
Matrix calculations are done using numeric.js, ml-pca, and tsnejs
Visualizations are done using Vega Lite
Busy animated gif is from http://www.andrewdavidson.com/articles/spinning-wait-icons/, used under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License

Settings

Text Select one or more text files: OR: select a sample text:	Tokenization Case-insensitive Omit punctuation
Matrix type Window Minimum number of word occurrences Pointwise Mutual Information Non-negative Smoothing	Dimensionality reduction Random projection: Number of random projection dimensions: Use SVD Number of SVD dimensions

Name:

Embeddings in progress

While the embedding is being created, click on the button to stop it from being created

Finished embeddings, available to visualize

After the embedding is finished, click on the button to save it to your computer

Step 1: Load one or more embeddings, either your own or one of the samples.
Load embedding:
AND/OR: select one or more sample embeddings:

Step 2: Check the embeddings you actually want to visualize.

Loaded Embeddings:

Step 3: Choose one of the visualizations and fill in the forms, then click show.
For example, in the "items" field you might put asked,answered.

Scatterplot Most similar Analogy

Comma separated items: Closest:

Comma separated items: Closest: Visualization type:

Analogy: A is to B as C is to X

For example using the provide GLoVe vectors: italy is to rome as france is to X.

is to as is to X
Method: Best: Exclude: Visualization type: