All Courses

Visualizing Word Embeddings

Word embeddings, like those generated by Word2Vec or GloVe, represent words as dense vectors, often residing in spaces with hundreds of dimensions (e.g., $w \in \mathbb{R}^{300}$ ). While these high-dimensional representations capture rich semantic information, their very nature makes them impossible for us humans to visualize directly. How can we gain an intuitive understanding of the relationships these models learn? How can we check if words we expect to be similar actually end up close together in the embedding space?

This is where dimensionality reduction techniques become indispensable tools. By projecting the high-dimensional word vectors down into two or three dimensions, we can create scatter plots that help us visually inspect the learned relationships. Two widely used methods for this purpose are Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE).

Principal Component Analysis (PCA)

PCA is a linear technique that identifies the directions (principal components) in the data that capture the maximum amount of variance. It transforms the original high-dimensional vectors into a lower-dimensional space while trying to preserve as much of the global structure and variance as possible.

Imagine trying to represent a 3D cloud of points on a 2D sheet of paper. PCA finds the best "angle" to view the cloud from, such that the projection onto the paper spreads the points out as much as possible, capturing the primary axes of variation.

While PCA is computationally efficient and deterministic (it always produces the same result for the same data), its focus on variance means it might not always be the best at preserving the local structure or the fine-grained similarities between nearby points in the original high-dimensional space.

t-Distributed Stochastic Neighbor Embedding (t-SNE)

t-SNE is a non-linear dimensionality reduction technique particularly well-suited for visualizing high-dimensional data in low dimensions (typically 2D or 3D). Unlike PCA, t-SNE focuses explicitly on modeling the similarity between points. It tries to ensure that points that are close together in the high-dimensional space remain close together in the low-dimensional map, and points that are far apart remain far apart.

t-SNE often produces more compelling visualizations that reveal clusters and local structures within the data. If Word2Vec learned meaningful relationships, t-SNE can often make these apparent, showing related words grouped together.

However, t-SNE has some considerations:

Computational Cost: It can be significantly slower than PCA, especially on large datasets.
Non-Deterministic: Running t-SNE multiple times on the same data might produce slightly different visualizations due to its probabilistic nature and optimization process.
Global Structure: The distances between clusters in a t-SNE plot are not always meaningful. It excels at showing which points are near each other, but the relative arrangement of distinct clusters can be misleading. Focus on the groupings, not the exact inter-cluster distances or cluster sizes.

Interpreting the Visualizations

Regardless of the method used (PCA or t-SNE), the goal is to project the word vectors into a 2D or 3D space and plot them as points. We can then label these points with their corresponding words and examine the resulting scatter plot.

What might we expect to see?

Semantic Clusters: Words with similar meanings should appear close together. For instance, 'cat', 'dog', 'hamster' might form a cluster, separate from a cluster of 'car', 'truck', 'bicycle'. Countries might cluster, cities might cluster, verbs related to movement might cluster, etc.
Relationships as Directions: Sometimes, the geometric arrangement reflects analogies. The classic example is that the vector offset from 'man' to 'woman' might be very similar to the offset from 'king' to 'queen'. This suggests $vector(\text{'king'}) - vector(\text{'man'}) + vector(\text{'woman'}) \approx vector(\text{'queen'})$ . While perfect geometric analogies are not guaranteed, visualizations can sometimes reveal these directional relationships.

Example Visualization

Let's imagine we have trained embeddings and selected a small subset of words related to countries, capitals, and royalty. After applying t-SNE to reduce their vectors (e.g., from 100 dimensions down to 2), we might get a plot like the one below.

A 2D t-SNE visualization showing semantic clustering. Royalty terms (purple), European capitals (blue), and European countries (teal) form distinct groups.

In this plot, we observe that words related to royalty ('king', 'queen', 'prince') cluster together. Similarly, capitals ('paris', 'london', 'berlin') form another group, and countries ('france', 'uk', 'germany', 'spain') form a third. This visual separation confirms that the embeddings have captured some of the expected semantic similarities and differences present in the training data. Visualizations like this provide valuable feedback on the quality of word embeddings. They help build understanding about what the model has learned and can sometimes highlight unexpected relationships or problems in the embedding space.

Was this section helpful?