Having established the limitations of frequency-based methods like TF-IDF and the intuition behind distributional semantics, let's examine Word2Vec, a foundational technique for learning word embeddings. Developed by Tomas Mikolov and colleagues at Google in 2013, Word2Vec isn't a single algorithm but rather a family of model architectures and optimizations used to learn vector representations of words from large text corpora.
The central idea is elegantly simple: instead of just counting word occurrences, Word2Vec trains a shallow neural network on a proxy prediction task involving words and their contexts. The remarkable outcome is that the weights learned by the hidden layer of this network serve as the word embeddings themselves. These learned vectors capture surprisingly rich semantic relationships.
Word2Vec primarily comes in two architectural flavors: Continuous Bag-of-Words (CBOW) and Skip-gram. Let's look at each.
The CBOW architecture tries to predict the current target word based on its surrounding context words. Imagine you have the sentence "the quick brown fox jumps over". If your target word is "fox", and you define a context window of size 2 (meaning 2 words before and 2 words after), the context words would be "quick", "brown", "jumps", and "over". CBOW uses these context words as input to predict the target word "fox".
How it works:
Conceptual flow of the CBOW architecture. Context word vectors are aggregated to predict the central target word.
CBOW tends to be computationally faster than Skip-gram and often provides slightly better representations for more frequent words. It essentially smooths over the distributional information from the context.
The Skip-gram architecture flips the prediction task: instead of predicting the target word from the context, it tries to predict the surrounding context words given the target word. Using our previous example, if the input target word is "fox", the model aims to predict words like "quick", "brown", "jumps", and "over".
How it works:
Conceptual flow of the Skip-gram architecture. The central target word vector is used to predict surrounding context words.
Skip-gram generally takes longer to train than CBOW because predicting multiple context words from a single target word is a more complex task, requiring more updates. However, it often performs better with smaller amounts of training data and is considered particularly effective at learning good representations for rare words or phrases. Each (target, context) pair provides a learning signal, whereas CBOW averages the context first.
Training these models efficiently on massive corpora requires optimization techniques. Calculating the softmax over the entire vocabulary (which can contain millions of words) for every prediction is computationally prohibitive. Two common strategies to address this are:
While understanding the details of these optimizations is useful, libraries like gensim
in Python handle their implementation, allowing you to focus on applying Word2Vec.
In essence, both CBOW and Skip-gram leverage simple neural network architectures trained on cleverly designed predictive tasks. The true product isn't the network's predictive capability itself, but the internal representations, the word embeddings, learned along the way. These embeddings map words into a vector space where distance and direction correspond to semantic and syntactic relationships.
© 2025 ApX Machine Learning