Distributional and frequency effects in word embeddings: different and larger embeddings

© 2018 Chris Culy, August 2018

chrisculy.net

Overview

This is part of ongoing work on word embeddings. In the previous series of posts we saw a wide range of distributional and frequency effects with respect to word embeddings. In this series of posts, I will use a series of summary tests to look at another embedding technique (continuous bag of words) as well as embeddings based on large corprora.

TL;DR: Results and Contributions

High level only. More details in the sections

  • Summary tests

    • Summary tests for distributional and frequency effects
    • sgns and cbow show different frequency effects with Vanity Fair
  • Large corpora

    • Frequency encoding is stronger for the larger corpora than for Vanity Fair
    • Frequency stratfication tends to be stronger for Vanity Fair than for the larger corpora +Frequency stratification tends to be direct for the larger corpora and indirect for Vanity Fair
    • Powerlaw for nearest neighbors is stronger for the larger corpora than for Vanity Fair
    • Similarity skewness is moderate
In [ ]: