Word embeddings, like those generated by Word2Vec or GloVe, represent words as dense vectors, often residing in spaces with hundreds of dimensions (e.g., w∈R300). While these high-dimensional representations capture rich semantic information, their very nature makes them impossible for us humans to visualize directly. How can we gain an intuitive understanding of the relationships these models learn? How can we check if words we expect to be similar actually end up close together in the embedding space?
This is where dimensionality reduction techniques become indispensable tools. By projecting the high-dimensional word vectors down into two or three dimensions, we can create scatter plots that help us visually inspect the learned relationships. Two widely used methods for this purpose are Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE).
PCA is a linear technique that identifies the directions (principal components) in the data that capture the maximum amount of variance. It transforms the original high-dimensional vectors into a lower-dimensional space while trying to preserve as much of the global structure and variance as possible.
Imagine trying to represent a 3D cloud of points on a 2D sheet of paper. PCA finds the best "angle" to view the cloud from, such that the projection onto the paper spreads the points out as much as possible, capturing the primary axes of variation.
While PCA is computationally efficient and deterministic (it always produces the same result for the same data), its focus on variance means it might not always be the best at preserving the local structure or the fine-grained similarities between nearby points in the original high-dimensional space.
t-SNE is a non-linear dimensionality reduction technique particularly well-suited for visualizing high-dimensional data in low dimensions (typically 2D or 3D). Unlike PCA, t-SNE focuses explicitly on modeling the similarity between points. It tries to ensure that points that are close together in the high-dimensional space remain close together in the low-dimensional map, and points that are far apart remain far apart.
t-SNE often produces more compelling visualizations that reveal clusters and local structures within the data. If Word2Vec learned meaningful relationships, t-SNE can often make these apparent, showing related words grouped together.
However, t-SNE has some considerations:
Regardless of the method used (PCA or t-SNE), the goal is to project the word vectors into a 2D or 3D space and plot them as points. We can then label these points with their corresponding words and examine the resulting scatter plot.
What might we expect to see?
Let's imagine we have trained embeddings and selected a small subset of words related to countries, capitals, and royalty. After applying t-SNE to reduce their vectors (e.g., from 100 dimensions down to 2), we might get a plot like the one below.
A hypothetical 2D t-SNE visualization showing semantic clustering. Royalty terms (purple), European capitals (blue), and European countries (teal) form distinct groups.
In this hypothetical plot, we observe that words related to royalty ('king', 'queen', 'prince') cluster together. Similarly, capitals ('paris', 'london', 'berlin') form another group, and countries ('france', 'uk', 'germany', 'spain') form a third. This visual separation confirms that the embeddings have captured some of the expected semantic similarities and differences present in the training data. Visualizations like this provide valuable qualitative feedback on the quality of word embeddings. They help build intuition about what the model has learned and can sometimes highlight unexpected relationships or problems in the embedding space.
© 2025 ApX Machine Learning