As we saw in previous chapters, methods like Bag-of-Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF) represent text based on word counts or frequencies. While useful for tasks where word occurrence is a strong signal, these approaches have a significant limitation: they fail to capture the meaning or semantic relationship between words. For a TF-IDF model, the words cat
and feline
are just distinct tokens with potentially different frequency scores; the model has no inherent understanding that they refer to very similar concepts. Similarly, run
and ran
are treated as completely separate entities, despite their strong grammatical relationship.
This is where distributional semantics comes into play. The core idea, often summarized by the linguist J.R. Firth's statement, "You shall know a word by the company it keeps," is that words that frequently appear in similar contexts tend to have similar meanings. Instead of just counting words in documents, we analyze the patterns of words that typically appear around a given word.
In this setting, the "context" of a word usually refers to the words surrounding it within a certain window size. For example, consider the sentence:
"The quick brown fox jumps over the lazy dog."
If we choose a context window of size 2 (meaning 2 words before and 2 words after), the context for the word fox
would be (quick
, brown
, jumps
, over
).
Now, imagine processing a massive amount of text (a corpus). We can observe the contexts in which different words appear. The distributional hypothesis suggests that words like cat
and feline
will often appear surrounded by similar words (e.g., meow
, purr
, pet
, claws
), while a word like car
will appear in different contexts (e.g., drive
, engine
, road
, wheel
).
Words like 'banking' and 'finance' tend to share similar neighboring words (context) across large amounts of text, distinct from the typical neighbors of words like 'river' and 'stream'.
The fundamental insight of distributional semantics is that we can use these contextual patterns to create numerical representations of words, often called word embeddings or word vectors. The goal is to learn a vector for each word, such that words appearing in similar contexts have vectors that are close to each other in the vector space (e.g., their cosine similarity is high).
Think about it: if cat
and feline
constantly share context words like pet
, purr
, and meow
, an algorithm designed to capture these co-occurrence patterns should naturally place their learned vectors near each other. This contrasts sharply with frequency-based methods where such similarity is purely coincidental based on document overlap, not neighboring words.
Essentially, we are shifting from sparse, high-dimensional representations based on counts (like BoW or TF-IDF vectors, where most entries are zero) to dense, lower-dimensional vectors (e.g., 50 to 300 dimensions) where each dimension captures some latent aspect of the word's meaning derived from its usage.
This approach doesn't just group synonyms. It can also capture more complex relationships, like analogies. For instance, a well-trained set of word embeddings might exhibit the relationship vector(′king′)−vector(′man′)+vector(′woman′)≈vector(′queen′), simply because the contextual relationship between king
and man
mirrors that between queen
and woman
across the training corpus.
The idea of using co-occurrence statistics isn't entirely new; methods based on building large word-context matrices have existed for a while. However, these matrices are often enormous and sparse. The significant development came with algorithms like Word2Vec and GloVe, which provide efficient ways to learn these dense, meaningful vector representations directly from text data. These algorithms form the basis for many modern NLP applications and are the focus of the upcoming sections in this chapter.
© 2025 ApX Machine Learning