As discussed in Chapter 1, vector embeddings, often represented as high-dimensional floating-point vectors (e.g., float32), capture rich semantic information. However, storing and searching through millions or billions of these dense vectors presents significant challenges. The sheer memory footprint can be enormous, and calculating distances between high-dimensional float vectors during search is computationally intensive, directly impacting latency and throughput.

To address these challenges, quantization techniques are employed. Quantization, in this context, refers to methods that reduce the memory required to store each vector, typically by representing its components with lower precision. This compression not only saves memory but can also accelerate distance computations, often at the cost of some precision in the search results (a potential decrease in recall). We'll explore two primary families of quantization techniques: Scalar Quantization and Product Quantization.

Scalar Quantization (SQ)

Scalar Quantization is perhaps the most straightforward approach. It operates independently on each dimension (scalar component) of a vector. The core idea is to map the original floating-point value in each dimension to a value from a smaller set, usually represented by fewer bits.

Common forms of SQ include:

Floating-Point Reduction: Converting float32 vectors to float16 (half-precision). This immediately halves the memory footprint and can speed up computation on hardware supporting float16 operations, with relatively modest impact on accuracy for many applications.
Integer Quantization: Mapping float values within a certain range to integers, most commonly 8-bit integers (int8). For each dimension, the range of possible values across the dataset is determined, and this range is divided into $2^8 = 256$ bins. Each original float value is then mapped to the integer ID of the bin it falls into. This reduces storage to 1 byte per dimension. Distance calculations can sometimes be accelerated using specialized CPU instructions (SIMD) designed for integer operations.
Binary Quantization: An extreme form where each dimension is reduced to a single bit (0 or 1), often based on whether the original value is positive or negative, or above/below a certain threshold. This offers maximum compression per dimension but typically comes with a significant loss of information. Distance computations often rely on Hamming distance.

How SQ Works (int8 Example):

Imagine a single dimension of your vectors contains values ranging from -10.0 to +10.0 across the dataset.

Determine Range: Find the min (-10.0) and max (+10.0) values for this dimension.
Define Bins: Divide the range [-10.0, 10.0] into 256 equal intervals.
Map Value: For a vector with value 3.5 in this dimension, determine which interval it falls into and assign the corresponding int8 value (e.g., interval 172 might represent values from 3.4 to 3.55).
Store: Store the int8 value (172) instead of the float32 value (3.5).

Pros of SQ:

Conceptually simple and easy to implement.
Reduces memory usage (e.g., 4x reduction for float32 to int8).
Can accelerate distance calculations (especially int8 with SIMD or binary with Hamming distance).

Cons of SQ:

Compression ratio is limited by the number of dimensions (e.g., int8 always uses 1 byte per dimension, regardless of original precision).
Information loss occurs independently in each dimension, which can sometimes disproportionately affect dimensions with high variance or significance.
For very high compression goals, SQ might lead to substantial accuracy degradation.

Product Quantization (PQ)

Product Quantization offers a more sophisticated approach, often achieving much higher compression ratios than SQ while aiming to preserve more vector information relative to the degree of compression. Instead of quantizing each dimension independently, PQ works with segments of the vector.

How PQ Works:

Sub-vector Division: Divide each high-dimensional vector $v \in \mathbb{R}^D$ into $M$ distinct sub-vectors of dimension $D/M$ . (Assume $D$ is divisible by $M$ for simplicity). $v = [v_1 | v_2 | \dots | v_M], \quad v_m \in \mathbb{R}^{D/M}$
Codebook Generation (Training): For each sub-vector position $m$ (from 1 to $M$ ), run a clustering algorithm (typically k-means) on the set of all $m$ -th sub-vectors from the entire dataset. This generates $M$ separate "codebooks". Each codebook $C_m$ consists of $k$ centroids (also called codewords), where each centroid $c_{m,j} \in \mathbb{R}^{D/M}$ represents a typical sub-vector found at position $m$ . Usually, $k$ is set to 256, allowing each centroid index to be stored in 8 bits (1 byte).
Vector Encoding: To quantize an original vector $v$ , consider each of its sub-vectors $v_m$ . For each $v_m$ , find the nearest centroid $c_{m,j}$ in the corresponding codebook $C_m$ . The original vector $v$ is then represented by the sequence of $M$ centroid indices (IDs) that best approximate its sub-vectors. $PQ(v) = [id(v_1), id(v_2), \dots, id(v_M)]$ where $id(v_m) = \underset{j \in \{1, \dots, k\}}{\operatorname{argmin}} \| v_m - c_{m,j} \|_2^2$ .
Storage: Instead of storing the original $D$ -dimensional float32 vector ( $D \times 32$ bits), we store $M$ indices. If $k=256$ , each index requires $\log_2(256) = 8$ bits. The total storage is $M \times 8$ bits (or $M$ bytes), regardless of the original dimension $D$ . This allows for significant compression, especially when $D$ is large. For example, a 768-dimensional float32 vector takes 3072 bytes. Using PQ with $M=96$ (and $k=256$ ) compresses it down to just 96 bytes.

Distance Calculation (Asymmetric Distance Computation - ADC):

Exact distances between fully quantized vectors are rarely computed directly. Instead, when searching with a query vector $q$ (usually kept in its original float format), we use Asymmetric Distance Computation (ADC).

Divide the query $q$ into $M$ sub-vectors $[q_1 | q_2 | \dots | q_M]$ .
For a database vector represented by PQ codes $[id_1, id_2, \dots, id_M]$ , its approximate squared Euclidean distance to $q$ can be calculated by summing the distances between query sub-vectors and the corresponding centroids looked up from the codebooks: $d(q, PQ(v))^2 \approx \sum_{m=1}^{M} \| q_m - c_{m, id_m} \|_2^2$ where $c_{m, id_m}$ is the centroid corresponding to the stored index $id_m$ in the $m$ -th codebook.
To speed this up further, the distances between each query sub-vector $q_m$ and all $k$ centroids in the $m$ -th codebook ( $C_m$ ) can be pre-calculated when the query arrives. The final distance sum then becomes a series of lookups into these pre-calculated distance tables.

Pros of PQ:

Achieves very high compression ratios, largely independent of the original vector dimension $D$ .
Often preserves search accuracy better than SQ for the same level of compression, as it optimizes quantization error across sub-vector spaces.
Distance calculation using ADC with precomputed tables can be very fast.

Cons of PQ:

More complex to understand and implement than SQ.
Requires a representative training dataset to build the codebooks via k-means, which can be computationally expensive.
The choice of $M$ (number of sub-vectors) and $k$ (number of centroids per codebook) involves trade-offs between compression rate, accuracy, and memory usage for codebooks/distance tables.
Introduces quantization error that impacts search accuracy; the quality depends heavily on the training data and clustering process.

Comparing SQ and PQ

The choice between Scalar and Product Quantization depends on the specific requirements of your application, particularly the desired balance between memory footprint, search speed, and acceptable accuracy loss.

Feature	Scalar Quantization (e.g., int8)	Product Quantization (e.g., M segments, k=256)
Mechanism	Quantize each dimension independently	Quantize sub-vectors using codebooks
Compression	Moderate (e.g., 4x for float32->int8)	High (e.g., 32x for D=768, M=96)
Memory	$D \times (\text{bits per dim})$	$M \times \log_2(k)$ bits (plus codebooks)
Complexity	Lower	Higher (requires training)
Training Data	Not strictly required (but helps range)	Required for k-means clustering
Accuracy Loss	Can be high for large compression	Generally lower for same compression ratio
Distance Calc	Simple (potentially SIMD accelerated)	ADC (fast with precomputed tables)
Typical Use	Moderate compression, simplicity	High compression, large datasets

Estimated memory usage per 768-dimensional vector for different representations. PQ achieves significantly higher compression compared to SQ methods.

Both SQ and PQ are fundamental tools for managing the resource demands of large-scale vector search. While SQ offers simplicity, PQ provides a path to much greater compression, making it a cornerstone technique for many production systems dealing with billions of vectors. Understanding the mechanics and trade-offs of both is essential for optimizing your vector search pipeline. Building on this, the next section explores Optimized Product Quantization (OPQ), a refinement aimed at further improving PQ's effectiveness.