Storing and searching through millions or billions of high-dimensional vectors, like those generated by LLM embedding models, presents significant challenges. The sheer memory footprint of these vectors (often float32, requiring D×4 bytes per vector) can be prohibitive, and calculating exact distances between a query vector and all database vectors becomes computationally expensive. Product Quantization (PQ) is a vector compression technique designed to address both these issues, enabling substantial memory savings and faster approximate distance calculations.
PQ operates on the principle of "divide and conquer." Instead of quantizing the entire high-dimensional vector at once, which often leads to poor resolution, PQ first decomposes the vector into multiple lower-dimensional sub-vectors.
Let's consider an original vector x in a D-dimensional space (x∈RD). PQ splits this vector into M distinct sub-vectors, each of dimension D/M. We assume D is divisible by M for simplicity.
x=[x1∣x2∣…∣xM]Here, each xm (1≤m≤M) is a sub-vector of dimension D/M. For example, a 768-dimensional vector could be split into M=48 sub-vectors, each of dimension D/M=16.
The core idea is to quantize each of these M subspaces independently. For each subspace m, we create a separate codebook Cm. This codebook consists of k∗ representative vectors, often called centroids, specific to that subspace. These centroids are typically learned using the k-means clustering algorithm on a training set of sub-vectors corresponding to that subspace.
Cm={cm,1,cm,2,…,cm,k∗}where cm,j∈RD/MThe number of centroids, k∗, is a crucial parameter. It determines the granularity of the quantization within each subspace. A common choice is k∗=256, allowing each centroid index to be represented by a single byte (8 bits).
Once the M codebooks are generated, any original vector x can be encoded. Each sub-vector xm of x is mapped to the index (ID) of its nearest centroid in the corresponding codebook Cm. Let qm(xm) be the function that finds the index of the nearest centroid in Cm for sub-vector xm:
qm(xm)=j∈{1,…,k∗}argmin∥xm−cm,j∥2The original vector x is then represented by a sequence of M centroid indices:
code(x)=[q1(x1),q2(x2),…,qM(xM)]Each index qm(xm) requires log2(k∗) bits. If we choose k∗=256, each index needs 8 bits (1 byte). Thus, the entire D-dimensional vector x is compressed into just M bytes. Compared to storing the original D-dimensional float32 vector (requiring D×4 bytes), this represents a significant compression ratio, typically ranging from 10x to 100x depending on D and M. For our 768-dimensional example with M=48, the compressed size is 48 bytes, compared to 768×4=3072 bytes originally, a compression factor of 64x.
The process of Product Quantization: A high-dimensional vector is split into sub-vectors. Each subspace undergoes k-means clustering to create a codebook. The original vector is then represented by the sequence of nearest centroid IDs from each subspace.
The primary benefit of PQ, beyond memory compression, is the ability to quickly approximate the distance between a query vector q and a compressed database vector x (represented by its code, code(x)).
The most common method is Asymmetric Distance Computation (ADC). Here, the query vector q remains in its original, uncompressed form. The distance to a database vector x (represented by its PQ code) is approximated by summing the distances between the query's sub-vectors qm and the centroids corresponding to the codes stored for x.
Specifically, let code(x)=[id1,id2,…,idM], where idm=qm(xm). The approximate squared Euclidean distance is calculated as:
d^(q,x)2≈m=1∑M∥qm−cm,idm∥2The key optimization here is that the distances ∥qm−cm,j∥2 for all m∈{1,…,M} and j∈{1,…,k∗} can be precomputed once per query q. This creates M lookup tables, each containing k∗ precomputed distances. The total distance approximation for a database vector x then reduces to M table lookups and M−1 additions. This is significantly faster than calculating the full D-dimensional distance.
An alternative is Symmetric Distance Computation (SDC), where the query vector q is also quantized using the same codebooks. The distance is then approximated using only the centroids:
d~(q,x)2≈m=1∑M∥cm,qm(qm)−cm,qm(xm)∥2SDC can be even faster as it operates entirely in the compressed domain, potentially using precomputed distances between all pairs of centroids within each subspace. However, quantizing the query introduces additional approximation error, generally making SDC less accurate than ADC. ADC is usually preferred when accuracy is prioritized.
Product Quantization offers substantial benefits but comes with inherent trade-offs:
Accuracy vs. Compression: PQ is a lossy compression technique. Information is lost when mapping sub-vectors to centroids. This loss directly impacts the accuracy (recall) of the nearest neighbor search. The primary levers controlling this trade-off are:
Training Cost: Generating the M codebooks requires running k-means clustering M times on potentially large datasets of sub-vectors derived from a representative training sample. This offline training step can be computationally intensive.
Integration with Index Structures: While PQ provides compression and fast approximate distances, it doesn't inherently provide a sub-linear search structure like HNSW or IVF. PQ is most often used in combination with these structures. For instance, in IVFPQ (covered in the next section), PQ is used to compress the vectors stored within the inverted lists of an IVF index, dramatically reducing the index size and speeding up distance calculations during the search refinement stage. Similarly, PQ can compress vectors stored at the nodes of an HNSW graph.
Product Quantization serves as a foundational technique for managing the scale of modern vector datasets. By decomposing vectors and quantizing subspaces independently, it achieves remarkable compression and enables rapid approximate distance calculations, making it an indispensable tool in the construction of large-scale, efficient vector search systems. Understanding its mechanics is essential for optimizing memory usage and query latency in demanding LLM applications.
© 2025 ApX Machine Learning