As we discussed, embedding models often generate vectors with hundreds or even thousands of dimensions. While these high-dimensional spaces can capture complex relationships, they also introduce challenges often referred to as the "Curse of Dimensionality." Working with data in extremely high dimensions can:
Dimensionality reduction techniques offer a way to mitigate these issues. The fundamental goal is to transform data from a high-dimensional space into a lower-dimensional space while preserving meaningful properties of the original data as much as possible. Think of it like creating a concise summary or a shadow of the original data in fewer dimensions.
An illustration of dimensionality reduction mapping points from a higher dimension to a lower one.
What properties do we want to preserve? It depends on the technique and the goal:
While there are many algorithms, two common approaches you'll encounter are Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP).
PCA is a linear technique that aims to find the directions (principal components) in the data that capture the maximum variance. Imagine rotating the data axes so that the first new axis aligns with the direction of the greatest spread, the second axis (orthogonal to the first) aligns with the next greatest spread, and so on. By keeping only the first few principal components, you retain most of the data's overall variance in fewer dimensions. It's effective when the underlying structure you care about is related to this variance.
UMAP is a non-linear technique particularly adept at preserving the local structure and topological properties of the data. It tries to ensure that points close together in the high-dimensional space remain close together in the lower-dimensional mapping. UMAP is often favored for visualizing high-dimensional embeddings (like reducing them to 2D or 3D for plotting) because it can reveal clusters and relationships that might be obscured by techniques like PCA, which focus solely on global variance.
Applying dimensionality reduction can offer several benefits:
However, there's an inherent trade-off:
While modern vector databases and their associated ANN indexing algorithms (like HNSW, which we'll cover in Chapter 3) are specifically designed to handle high-dimensional vectors efficiently, understanding dimensionality reduction is still valuable.
In practice, for many semantic search tasks using modern embedding models (often with dimensions like 384, 768, or 1024), developers often index the full-dimensional vectors directly, relying on the power of ANN algorithms. However, dimensionality reduction remains an important tool in the data scientist's toolkit, particularly for analysis, visualization, or resource-constrained environments.
© 2025 ApX Machine Learning