Quantization is a set of techniques used to reduce the precision of numbers representing a model's weights and/or activations. Instead of using standard 32-bit floating-point (FP32) numbers, we map these values to lower-precision formats, most commonly 8-bit integers (INT8) but also increasingly lower-bit floating-point formats like FP8. As introduced, the primary motivation is performance improvement: lower-precision arithmetic is faster on supporting hardware, requires less memory bandwidth, consumes less power, and results in smaller model sizes. However, this compression is inherently lossy; the challenge lies in minimizing the impact on model accuracy while maximizing performance gains. This section examines the fundamental concepts governing this process.
At its core, quantization maps a real number R from a high-precision range [Rmin,Rmax] (e.g., observed FP32 values) to a quantized value Q within a lower-precision integer range [Qmin,Qmax] (e.g., [-128, 127] for signed INT8). This mapping is typically affine, defined by two parameters: the scale (S) and the zero-point (Z).
The quantization function maps a real value R to its quantized integer representation Q:
Q=clamp(round(SR)+Z,Qmin,Qmax)
The round
function typically rounds to the nearest integer (with ties broken arbitrarily or consistently, e.g., round-half-to-even). The clamp
function ensures the result stays within the target integer range [Qmin,Qmax].
The corresponding dequantization function maps the integer Q back to an approximate real value R′: R′=S×(Q−Z) Crucially, R′≈R. The difference R−R′ is the quantization error, the source of potential accuracy degradation. Minimizing this error across the model is the central goal of quantization techniques.
The choice of the real range [Rmin,Rmax] influences the calculation of S and Z and leads to two primary mapping schemes:
In symmetric quantization, the real value range is centered around zero, meaning Rmin=−Rmax. The maximum absolute value, max(∣Rmin∣,∣Rmax∣), determines the range.
Symmetric quantization is often preferred for weights, which tend to have distributions centered around zero. Its advantage lies in computational efficiency, as the zero-point offset Z doesn't need to be handled during calculations (it's implicitly 0). However, if the actual data distribution is heavily skewed, forcing a symmetric range can lead to larger quantization errors for values near the less-populated end of the distribution.
Asymmetric quantization uses the actual observed minimum and maximum values, Rmin and Rmax, without forcing symmetry. Both S and Z are calculated as defined earlier. The real value 0.0 maps exactly to the integer Z.
This scheme can represent the input range more accurately, especially for data distributions that are not centered around zero, such as the outputs of activation functions like ReLU (which are always non-negative). The trade-off is increased computational complexity, as the zero-point Z must be accounted for during arithmetic operations (e.g., adjusting results after multiplications).
Comparison of symmetric and asymmetric mapping from a floating-point range to INT8. Symmetric forces the range around 0.0, mapping to integer 0. Asymmetric uses the actual range, mapping 0.0 to the calculated zero-point Z.
Another important aspect is the granularity at which the quantization parameters (S and Z) are applied:
Per-Tensor Quantization: A single pair of (S,Z) values is calculated and applied to all elements within an entire tensor (e.g., all weights in a convolutional layer, or all elements in an activation tensor). This is the simplest approach, minimizing the overhead of storing and managing quantization parameters. However, if the value distribution varies significantly across different parts of the tensor (e.g., different output channels in a convolution filter), using a single range [Rmin,Rmax] can be suboptimal, potentially clipping important values or underutilizing the quantization range for certain subsets of the tensor.
Per-Channel (or Per-Axis) Quantization: Separate (S,Z) pairs are calculated for slices of the tensor along a specific dimension or axis. For convolutional layer weights, this is most commonly done along the output channel axis (axis=0
for weights of shape [output_channels, input_channels, kH, kW]
). This allows the quantization range to adapt more closely to the specific distribution of values within each channel, often leading to significantly better accuracy compared to per-tensor quantization, especially for deep convolutional networks. The trade-off is the increased number of S and Z parameters that need to be stored and used during computation. Other granularities like per-group quantization (grouping channels) also exist as intermediate options.
The choice between per-tensor and per-channel often depends on the specific layer type, its sensitivity to quantization noise, and the target hardware's ability to efficiently handle varying scales and zero-points.
The effectiveness of quantization hinges on selecting the right clipping range [Rmin,Rmax]. This range directly determines the scale S and potentially the zero-point Z. If the range is too narrow, values outside it will be clipped, introducing significant error. If the range is too wide, the quantization steps become large (larger S), reducing precision for values within the range.
The process of determining the optimal range is called calibration. It typically involves:
Calibration is primarily associated with Post-Training Quantization (PTQ), where a pre-trained FP32 model is converted to a lower-precision format without retraining. In Quantization-Aware Training (QAT), simulated quantization operations are inserted into the model during training, allowing the model to adapt its weights to the reduced precision and effectively learn the optimal ranges implicitly. However, even QAT might benefit from initial range estimates provided by a calibration step.
INT8 (8-bit integer) quantization is currently the most widely adopted low-precision format.
While INT8 offers substantial benefits, research and hardware development are pushing towards even lower precision, particularly with 8-bit floating-point formats (FP8). Unlike INT8, FP8 retains the exponent-mantissa structure of floating-point numbers, potentially offering a better balance between dynamic range and precision for certain applications, including training.
There isn't a single FP8 standard yet; two prominent variants defined by NVIDIA and supported by other industry players are:
Bit allocation across different numerical formats used in ML. FP formats dedicate bits to exponent and mantissa, determining range and precision. INT8 uses all non-sign bits for magnitude. FP8 variants trade exponent bits (range) for mantissa bits (precision).
Trade-offs of FP8 vs. INT8:
FP8 requires careful management of scaling factors, similar to FP16 mixed-precision training, often adjusted dynamically. Compilers play a significant role in selecting the appropriate FP8 format (E4M3 vs. E5M2) for different tensors and generating efficient code using hardware-specific FP8 instructions.
Understanding these fundamental concepts, mapping schemes, calibration, granularity, and the characteristics of INT8 and FP8, is prerequisite for designing and implementing the compiler and runtime techniques needed to harness the performance benefits of low-precision computation, which we will examine in subsequent sections.
© 2025 ApX Machine Learning