While Word2Vec models like Skip-gram and CBOW learn embeddings by predicting local context words within a sliding window, they don't directly leverage the vast amount of statistical information present across the entire corpus. Methods like Latent Semantic Analysis (LSA) do use global statistics (matrix factorization on term-document or term-term matrices), but they often perform relatively poorly on tasks like word analogy, which measure finer semantic relationships.
GloVe, standing for Global Vectors for Word Representation, developed by Jeffrey Pennington, Richard Socher, and Christopher D. Manning at Stanford, aims to bridge this gap. It's designed to capture the best of both worlds: the meaning-capturing capabilities demonstrated by local context prediction methods (like Word2Vec) and the statistical power derived from global matrix factorization techniques.
The central intuition behind GloVe is that ratios of word-word co-occurrence probabilities hold meaningful information. Consider the co-occurrence probabilities for words related to "ice" and "steam".
Let P(k∣w)=Pwk be the probability that word k appears in the context of word w. Now, let's look at the ratio of these probabilities for different probe words k.
GloVe hypothesizes that these ratios encode the relationships between words, and the goal is to learn word vectors that capture these ratios.
GloVe starts by constructing a large word-word co-occurrence matrix, denoted by X. An entry Xij in this matrix represents the number of times word j (the context word) appears within a specific context window of word i (the target word) across the entire corpus.
The definition of the "context window" is similar to that used in Word2Vec. Often, the contribution of a co-occurring word pair is weighted based on the distance between the words, with closer words contributing more. For instance, a weighting function like 1/d (where d is the distance) might be used.
Let Xi=∑kXik be the total number of times any word appears in the context of word i. Then, the probability of seeing word j in the context of word i is Pij=P(j∣i)=Xij/Xi.
A conceptual representation of how word co-occurrences from a text corpus contribute to the entries in the co-occurrence matrix X.
GloVe aims to learn vectors for each word such that their dot product relates directly to their probability of co-occurrence. The model starts with the general idea that the relationship between three words i, j, and k can be modeled using a function F acting on their word vectors:
F(wi,wj,w~k)=PjkPik
Here, wi and wj are the vectors for the primary words, and w~k is a separate context word vector for the probe word k. Using two sets of vectors (w and w~) makes the model more robust and easier to train, capturing the asymmetry in co-occurrence (e.g., P(solid∣ice) is not necessarily the same as P(ice∣solid)).
Through mathematical derivation involving properties of vector differences and homomorphisms, the GloVe authors arrive at a specific form relating the dot product of word vectors to the logarithm of their co-occurrence count:
wiTw~j+bi+b~j≈log(Xij)
Here, bi and b~j are scalar bias terms for the target word i and context word j, respectively. These biases help capture frequency effects independent of the vector interactions.
This relationship forms the basis of the GloVe objective function, which is a weighted least squares regression model:
J=∑i,j=1Vf(Xij)(wiTw~j+bi+b~j−log(Xij))2
The weighting function f(Xij) is important. It serves two main purposes:
A commonly used weighting function is:
f(x)={(x/xmax)α1if x<xmaxotherwise
where xmax is a threshold (e.g., 100) and α is typically set to 3/4.
The model is trained using techniques like AdaGrad to minimize the objective function J, learning the optimal values for the word vectors wi, context vectors w~j, and biases bi,b~j. Often, the final representation for a word i is taken as the sum wi+w~i.
Like Word2Vec, pre-trained GloVe vectors are widely available, trained on massive datasets like Wikipedia or Common Crawl. These pre-trained vectors often provide a strong starting point for various NLP tasks, saving significant computational effort compared to training from scratch. You can load these vectors and use them as input features for models handling tasks like text classification, sentiment analysis, or named entity recognition, similar to how you would use pre-trained Word2Vec embeddings.
In the next sections, we will look at how to visualize these high-dimensional embeddings and discuss practical ways to load and use pre-trained models.
© 2025 ApX Machine Learning