All Courses

Word2Vec: CBOW and Skip-gram Architectures

Having established the limitations of frequency-based methods like TF-IDF and the intuition behind distributional semantics, let's examine Word2Vec, a foundational technique for learning word embeddings. Developed by Tomas Mikolov and colleagues at Google in 2013, Word2Vec isn't a single algorithm but rather a family of model architectures and optimizations used to learn vector representations of words from large text corpora.

The central idea is elegantly simple: instead of just counting word occurrences, Word2Vec trains a shallow neural network on a proxy prediction task involving words and their contexts. The remarkable outcome is that the weights learned by the hidden layer of this network serve as the word embeddings themselves. These learned vectors capture surprisingly rich semantic relationships.

Word2Vec primarily comes in two architectural flavors: Continuous Bag-of-Words (CBOW) and Skip-gram. Let's look at each.

Continuous Bag-of-Words (CBOW)

The CBOW architecture tries to predict the current target word based on its surrounding context words. Imagine you have the sentence "the quick brown fox jumps over". If your target word is "fox", and you define a context window of size 2 (meaning 2 words before and 2 words after), the context words would be "quick", "brown", "jumps", and "over". CBOW uses these context words as input to predict the target word "fox".

How it works:

Input: One-hot encoded vectors representing the context words within a defined window size.
Embedding Lookup: These one-hot vectors are used to look up their corresponding dense embedding vectors from an embedding matrix $W$ . Let's say our vocabulary size is $V$ and the desired embedding dimension is $N$ . The embedding matrix $W$ will have dimensions $V \times N$ .
Context Aggregation: The embedding vectors for the context words are typically averaged (or sometimes summed) to create a single context vector $v_{context}$ .
Prediction: This aggregated context vector is fed through a second weight matrix $W'$ (dimensions $N \times V$ ) followed by a softmax activation function. The output is a probability distribution over the entire vocabulary, indicating the likelihood of each word being the target word given the context. $P(\text{target} | \text{context}) = \text{softmax}(v_{context}^T W')$
Training: The network is trained by minimizing the difference (e.g., using cross-entropy loss) between the predicted probability distribution and the actual one-hot vector of the target word. The main point is that during this process, the weights in the embedding matrix $W$ are adjusted via backpropagation. These adjusted weights are the learned word embeddings.

Flow of the CBOW architecture. Context word vectors are aggregated to predict the central target word.

CBOW tends to be computationally faster than Skip-gram and often provides slightly better representations for more frequent words. It essentially smooths over the distributional information from the context.

Skip-gram

The Skip-gram architecture flips the prediction task: instead of predicting the target word from the context, it tries to predict the surrounding context words given the target word. Using our previous example, if the input target word is "fox", the model aims to predict words like "quick", "brown", "jumps", and "over".

How it works:

Input: A one-hot encoded vector representing the single target word.
Embedding Lookup: The corresponding dense embedding vector $v_{target}$ for the target word is retrieved from the embedding matrix $W$ (the same $V \times N$ matrix as in CBOW).
Prediction: This target vector $v_{target}$ is fed through the second weight matrix $W'$ (again, $N \times V$ ) followed by a softmax activation. However, the objective is different. For each position in the context window, the model outputs a probability distribution over the vocabulary. The goal is to maximize the probability of the actual context words appearing in those positions. $P(\text{context}_i | \text{target}) = \text{softmax}(v_{target}^T W')$ The overall objective often involves summing the log probabilities for all actual context words within the window.
Training: Similar to CBOW, the network is trained by minimizing the prediction error, adjusting the weights in the embedding matrix $W$ via backpropagation. These weights become the word embeddings.

Flow of the Skip-gram architecture. The central target word vector is used to predict surrounding context words.

Skip-gram generally takes longer to train than CBOW because predicting multiple context words from a single target word is a more complex task, requiring more updates. However, it often performs better with smaller amounts of training data and is considered particularly effective at learning good representations for rare words or phrases. Each (target, context) pair provides a learning signal, whereas CBOW averages the context first.

Training Optimizations

Training these models efficiently on massive corpora requires optimization techniques. Calculating the softmax over the entire vocabulary (which can contain millions of words) for every prediction is computationally prohibitive. Two common strategies to address this are:

Negative Sampling: Instead of updating weights for all incorrect words in the output layer, only a small sample of incorrect words (called "negative samples") is updated along with the correct target/context word(s). This dramatically reduces the computation needed per training step.
Hierarchical Softmax: This technique uses a binary tree structure (like a Huffman tree) to represent the vocabulary. The probability of predicting a word is calculated by navigating the tree from the root to the leaf node representing that word. This replaces the expensive $V$ -dimensional softmax calculation with approximately $\log_2(V)$ calculations.

While understanding the details of these optimizations is useful, libraries like gensim in Python handle their implementation, allowing you to focus on applying Word2Vec.

In essence, both CBOW and Skip-gram leverage simple neural network architectures trained on cleverly designed predictive tasks. The true product isn't the network's predictive capability itself, but the internal representations, the word embeddings, learned along the way. These embeddings map words into a vector space where distance and direction correspond to semantic and syntactic relationships.

Was this section helpful?