As we saw previously, methods like TF-IDF represent documents based on word frequencies. While useful, these representations treat words as independent units. They don't inherently understand that 'cat' and 'kitten' are related, or that 'running' and 'jogging' share semantic similarity. If two related words don't appear in very similar documents within the training set, frequency-based models consider them completely distinct. This limits their effectiveness for tasks requiring a deeper understanding of language meaning.
To overcome this, we need a way to represent words that captures their semantic relationships. This leads us to the concept of word embeddings, which are built upon the idea of distributional semantics. The core principle, often summarized as "you shall know a word by the company it keeps," suggests that words appearing in similar contexts likely have related meanings. For instance, words like 'coffee' and 'tea' often appear in contexts like "cup of ____," "drank some ____," or "hot ____." By analyzing these co-occurrence patterns across vast amounts of text, we can learn representations that place 'coffee' and 'tea' closer together in a conceptual space than, say, 'coffee' and ' C++ programming language'.
Instead of the high-dimensional, sparse vectors produced by techniques like Bag-of-Words or TF-IDF (where most entries are zero and the vector length equals the vocabulary size), word embeddings represent words as dense vectors in a lower-dimensional space. Typically, a word w is mapped to a vector vw∈Rd, where d is the embedding dimension (a hyperparameter, often between 50 and 300).
vword=[x1,x2,…,xd]Each dimension xi in this vector represents a latent feature of the word's meaning learned from the data. Unlike engineered features, these dimensions aren't usually interpretable in human terms (e.g., dimension 1 doesn't explicitly mean 'is an animal' or 'is edible'). Instead, they collectively capture complex nuances of meaning, usage, and relationships derived from the word's contextual patterns in the training corpus. The key is that the entire vector represents the word's meaning in relation to other words.
The magic of word embeddings lies in the geometric relationships between these vectors in the d-dimensional space.
Semantic Similarity: Words with similar meanings tend to have vectors that are close to each other. We can measure this closeness using distance metrics like Euclidean distance or, more commonly, cosine similarity. Cosine similarity measures the cosine of the angle between two vectors, ranging from -1 (opposite meanings) to 1 (identical meanings/contexts), with 0 indicating unrelatedness (orthogonality).
similarity(A,B)=cos(θ)=∥A∥∥B∥A⋅B=∑i=1dAi2∑i=1dBi2∑i=1dAiBiThus, the vectors for 'car' and 'automobile' should have a high cosine similarity, while 'car' and 'banana' should have a low similarity.
Semantic Analogies: Remarkably, these vector spaces often capture relational similarities. The classic example is the analogy "man is to woman as king is to queen." In the vector space, this relationship might manifest as:
vking−vman+vwoman≈vqueenThis suggests that the vector difference vking−vman captures a concept like 'royalty minus maleness,' and adding vwoman results in a vector very close to vqueen. Similar analogies like "France is to Paris as Germany is to Berlin" (vParis−vFrance+vGermany≈vBerlin) can also be observed.
Consider a simplified 2D visualization of where certain words might lie relative to each other in an embedding space:
A simplified representation showing how related words (king/man, queen/woman, apple/orange) might cluster together in the embedding space, separate from unrelated words (car). Actual embedding spaces are much higher dimensional.
These dense vector representations serve as powerful input features for various machine learning models applied to NLP tasks. Instead of feeding one-hot encoded vectors or TF-IDF scores into a classifier or sequence model, we can use the corresponding word embeddings. This often leads to significant improvements in performance because the model starts with features that already encode semantic information, reducing the burden on the model to learn these relationships from scratch.
In the following sections, we will investigate specific algorithms like Word2Vec and GloVe that learn these vector representations from large text corpora. We will also see how to use pre-computed embeddings, saving significant training time and leveraging knowledge learned from massive datasets.
© 2025 ApX Machine Learning