Skip to main content

Frequency effects in word embeddings

I've just posted a series of web pages and accompanying Jupyter notebooks about frequency effects in word embeddings. The short story is that frequency effects are pervasive in word embeddings and furthermore they differ from one embedding method to another.

Over the past several months I've been working with word embeddings, with the eventual goal of using them to help study language use in individual (or perhaps several) texts. Since word embeddings have been developed with very large corpora in mind (billions of words is not unusual) while I'm interested in small corpora, I thought it was worthwhile to understand word embeddings as they might work with these corpora. My first round of posts (with Juypter and R notebooks) was about how we might use word embeddings with small corpora, as in individual texts.

During that first round, I noticed some apparent frequency effects — aspects of the word embeddings that are connected with frequency. This is troublesome, since word embeddings are intended as representation of (one aspect of) meaning, and meaning is assumed to be independent of frequency: the meaning of dog does not depend on how often it is used (though its frequency might have other effects related to meaning, as in semantic change).

In this second round of exploration, I look at a range of Frequency effects in word embeddings, some of which have been reported in the literature, but many of which are new. As noted above, one of the interesting aspects is that different word embedding methods have different frequency effects, though the methods generally divide into two groups for any particular phenomenon.

There are still lots of questions, even more than before, but these posts provide more examples and points to ponder.