Storing and searching through millions or billions of high-dimensional vectors, like those generated by LLM embedding models, presents significant challenges. The sheer memory footprint of these vectors (often float32, requiring bytes per vector) can be prohibitive, and calculating exact distances between a query vector and all database vectors becomes computationally expensive. Product Quantization (PQ) is a vector compression technique designed to address both these issues, enabling substantial memory savings and faster approximate distance calculations.
PQ operates on the principle of "divide and conquer." Instead of quantizing the entire high-dimensional vector at once, which often leads to poor resolution, PQ first decomposes the vector into multiple lower-dimensional sub-vectors.
Let's consider an original vector in a -dimensional space (). PQ splits this vector into distinct sub-vectors, each of dimension . We assume is divisible by for simplicity.
Here, each () is a sub-vector of dimension . For example, a 768-dimensional vector could be split into sub-vectors, each of dimension .
The core idea is to quantize each of these subspaces independently. For each subspace , we create a separate codebook . This codebook consists of representative vectors, often called centroids, specific to that subspace. These centroids are typically learned using the k-means clustering algorithm on a training set of sub-vectors corresponding to that subspace.
The number of centroids, , is an important parameter. It determines the granularity of the quantization within each subspace. A common choice is , allowing each centroid index to be represented by a single byte (8 bits).
Once the codebooks are generated, any original vector can be encoded. Each sub-vector of is mapped to the index (ID) of its nearest centroid in the corresponding codebook . Let be the function that finds the index of the nearest centroid in for sub-vector :
The original vector is then represented by a sequence of centroid indices:
Each index requires bits. If we choose , each index needs 8 bits (1 byte). Thus, the entire -dimensional vector is compressed into just bytes. Compared to storing the original -dimensional float32 vector (requiring bytes), this represents a significant compression ratio, typically ranging from 10x to 100x depending on and . For our 768-dimensional example with , the compressed size is 48 bytes, compared to bytes originally, a compression factor of 64x.
The process of Product Quantization: A high-dimensional vector is split into sub-vectors. Each subspace undergoes k-means clustering to create a codebook. The original vector is then represented by the sequence of nearest centroid IDs from each subspace.
The primary benefit of PQ, instead of memory compression, is the ability to quickly approximate the distance between a query vector and a compressed database vector (represented by its code, ).
The most common method is Asymmetric Distance Computation (ADC). Here, the query vector remains in its original, uncompressed form. The distance to a database vector (represented by its PQ code) is approximated by summing the distances between the query's sub-vectors and the centroids corresponding to the codes stored for .
Specifically, let , where . The approximate squared Euclidean distance is calculated as:
The main optimization here is that the distances for all and can be precomputed once per query . This creates lookup tables, each containing precomputed distances. The total distance approximation for a database vector then reduces to table lookups and additions. This is significantly faster than calculating the full -dimensional distance.
An alternative is Symmetric Distance Computation (SDC), where the query vector is also quantized using the same codebooks. The distance is then approximated using only the centroids:
SDC can be even faster as it operates entirely in the compressed domain, potentially using precomputed distances between all pairs of centroids within each subspace. However, quantizing the query introduces additional approximation error, generally making SDC less accurate than ADC. ADC is usually preferred when accuracy is prioritized.
Product Quantization offers substantial benefits but comes with inherent trade-offs:
Accuracy vs. Compression: PQ is a lossy compression technique. Information is lost when mapping sub-vectors to centroids. This loss directly impacts the accuracy (recall) of the nearest neighbor search. The primary levers controlling this trade-off are:
Training Cost: Generating the codebooks requires running k-means clustering times on potentially large datasets of sub-vectors derived from a representative training sample. This offline training step can be computationally intensive.
Integration with Index Structures: While PQ provides compression and fast approximate distances, it doesn't inherently provide a sub-linear search structure like HNSW or IVF. PQ is most often used in combination with these structures. For instance, in IVFPQ (covered in the next section), PQ is used to compress the vectors stored within the inverted lists of an IVF index, dramatically reducing the index size and speeding up distance calculations during the search refinement stage. Similarly, PQ can compress vectors stored at the nodes of an HNSW graph.
Product Quantization serves as a foundational technique for managing the scale of modern vector datasets. By decomposing vectors and quantizing subspaces independently, it achieves remarkable compression and enables rapid approximate distance calculations, making it an indispensable tool in the construction of large-scale, efficient vector search systems. Understanding its mechanics is essential for optimizing memory usage and query latency in demanding LLM applications.
Cleaner syntax. Built-in debugging. Production-ready from day one.
Built for the AI systems behind ApX Machine Learning
Was this section helpful?
© 2026 ApX Machine LearningEngineered with