Quantization, by its nature, involves approximation. When we map values from a high-precision format like 32-bit floating-point ( $FP32$ ) to a lower-precision format like 8-bit integer ( $INT8$ ), we inevitably lose some information. This difference between the original value and its quantized representation is the quantization error, sometimes referred to as quantization noise. Understanding and measuring this error is fundamental to evaluating whether a quantization strategy is acceptable for a given model and application.

While the goal is to minimize this error, simply measuring the raw difference isn't always sufficient. We need metrics that help us understand the impact of this error on the model's overall behavior and performance.

Direct Error Metrics

At a basic level, we can directly compare the original floating-point tensors (weights or activations) with their quantized and then de-quantized counterparts. (De-quantization converts the low-precision integer back to a floating-point value, allowing direct comparison). Common metrics include:

Mean Squared Error (MSE): This measures the average squared difference between the original values ( $x$ ) and the quantized-dequantized values ( $x_q$ ). For a tensor with $N$ elements:
$MSE = \frac{1}{N} \sum_{i=1}^{N} (x_i - x_{q,i})^2$
A lower MSE generally indicates a better approximation. However, MSE is sensitive to outliers and doesn't always directly correlate with the model's final performance on complex tasks.
Signal-to-Noise Ratio (SNR): Often expressed in decibels (dB), SNR compares the power of the original signal (tensor values) to the power of the quantization noise (the error). A higher SNR indicates that the original signal is stronger relative to the error introduced by quantization. It's often calculated as:
$SNR = 10 \log_{10} \left( \frac{\sum_{i=1}^{N} x_i^2}{\sum_{i=1}^{N} (x_i - x_{q,i})^2} \right)$
A higher SNR (e.g., > 20 dB) is generally preferred, suggesting the quantization noise is relatively small compared to the original values.

These direct metrics are useful during the development and debugging of quantization algorithms, helping to assess the numerical fidelity of the quantization process itself for specific layers or tensors.

Model-Level Performance Metrics

While direct error metrics quantify the numerical difference, they don't tell the whole story about how quantization affects the LLM's capabilities. An LLM is a complex system, and small numerical errors might sometimes amplify, while other times they might have negligible impact on the final output. Therefore, evaluating the quantized model's performance on relevant tasks is essential.

Perplexity (PPL): This is a standard metric for evaluating language models. Perplexity measures how well a probability model predicts a sample. Lower perplexity indicates the model is less "surprised" by the test data, meaning it assigns higher probabilities to the observed sequences of words. When evaluating quantization, we typically compute the perplexity of the original $FP32$ model and the quantized model (e.g., $INT8$ ) on the same validation dataset. A small increase in perplexity for the quantized model compared to the original suggests that the quantization process has preserved the model's language modeling capabilities reasonably well. Significant increases might indicate problematic quantization.
Task-Specific Metrics: Often, the most meaningful evaluation involves measuring the performance of the quantized LLM on the specific downstream tasks it's intended for. This could involve:
- Accuracy: For classification tasks.
- F1-Score: For tasks like named entity recognition or question answering where precision and recall are important.
- BLEU/ROUGE Scores: For generation tasks like translation or summarization, comparing the model's output to reference texts.

Comparing these task-specific metrics between the original and quantized models provides the most direct assessment of the functional impact of quantization.

Relating Error to Quantization Choices

The magnitude and distribution of quantization error are influenced by the choices made during the quantization process:

Data Type: Quantizing to $INT4$ will generally introduce more error than quantizing to $INT8$ .
Quantization Scheme: Asymmetric quantization might handle skewed distributions better than symmetric quantization in some cases, potentially leading to lower error.
Granularity: Per-channel or per-group quantization often yields lower error than per-tensor quantization because it can adapt the quantization parameters (scale and zero-point) more closely to the specific statistics of smaller subsets of weights or activations.
Calibration Data: For static quantization methods (discussed in Chapter 2), the choice of calibration data used to determine the quantization parameters significantly impacts the resulting error on unseen data.

The Trade-off

Measuring quantization error isn't about achieving zero error; it's about understanding the trade-off. We are intentionally reducing precision to gain efficiency (smaller size, faster inference, lower energy consumption). The main point is to quantify the resulting drop in accuracy or performance using appropriate metrics (like perplexity or task-specific scores) and determine if this drop is acceptable for the target application in exchange for the efficiency gains. The techniques discussed in later chapters, such as advanced PTQ methods and QAT, aim to minimize this performance degradation while still achieving significant compression.

Measuring Quantization Error