Distributional and frequency effects in word embeddings: different and larger embeddings¶

Overview¶

This is part of ongoing work on word embeddings. In the previous series of posts we saw a wide range of distributional and frequency effects with respect to word embeddings. In this series of posts, I will use a series of summary tests to look at another embedding technique (continuous bag of words) as well as embeddings based on large corprora.

TL;DR: Results and Contributions¶

High level only. More details in the sections

• Summary tests

• Summary tests for distributional and frequency effects
• sgns and cbow show different frequency effects with Vanity Fair
• Large corpora

• Frequency encoding is stronger for the larger corpora than for Vanity Fair
• Frequency stratfication tends to be stronger for Vanity Fair than for the larger corpora +Frequency stratification tends to be direct for the larger corpora and indirect for Vanity Fair
• Powerlaw for nearest neighbors is stronger for the larger corpora than for Vanity Fair
• Similarity skewness is moderate
In [ ]: