While Post-Training Quantization aims to map the range of floating-point values to a lower-precision integer range, a significant challenge arises from the presence of outliers. Outliers are values in the weight or activation tensors that lie far outside the typical distribution of values. Even a single extreme value can disproportionately affect the quantization process, potentially leading to substantial accuracy degradation.
Most basic PTQ algorithms, like MinMax quantization, determine the scaling factor based on the absolute minimum and maximum values observed in a tensor (or a subset during calibration). Consider a tensor where most values are between -1.0 and 1.0, but a single outlier exists with a value of 100.0.
If we use MinMax scaling, the entire range from -1.0 to 100.0 must be mapped to the available integer range (e.g., -128 to 127 for INT8).
scale=max_int−min_intmax_float−min_float=127−(−128)100.0−(−1.0)=255101.0≈0.396This large scaling factor means the majority of the original values (between -1.0 and 1.0) will be compressed into a very small portion of the integer range:
All the nuanced information previously represented between -1.0 and 1.0 is now squeezed into roughly 6 integer values (from -3 to 3). The rest of the integer range (from -128 to -4 and 4 to 127) is reserved mostly for mapping the single outlier value of 100.0 (which maps to round(100.0/0.396)≈127, potentially after adjusting for the zero-point). This loss of resolution for the bulk of the values often results in a significant drop in model performance.
The presence of an outlier significantly widens the required range for MinMax quantization, reducing precision for the majority of values.
Several techniques can mitigate the negative impact of outliers during PTQ:
Clipping (Saturation): Instead of using the absolute minimum and maximum values from the calibration data, we can choose thresholds (clipping values) and force any value outside these thresholds to the boundary value. This process is also known as saturation. For example, we might decide to clip the range based on the 1st and 99th percentiles of the observed values during calibration.
Clipping limits the quantization range by saturating outlier values to predefined minimum and maximum thresholds.
The choice of clipping threshold involves a trade-off: clipping too aggressively discards information from potentially important large values, while clipping too little retains the problem of poor resolution for common values. Selecting the right threshold often requires evaluating the impact on model accuracy using a representative validation dataset.
Alternative Calibration Methods: Algorithms other than simple MinMax, such as those based on minimizing Kullback-Leibler (KL) divergence (Entropy) or using Mean Squared Error (MSE), inherently handle outliers more gracefully. These methods try to find quantization parameters that best preserve the overall distribution shape or minimize the error for the majority of values, rather than being solely dictated by the extreme values. Using percentile-based calibration directly sets the range based on percentiles, effectively clipping the outliers.
Finer Quantization Granularity: As introduced in Chapter 1, quantization can be applied per-tensor, per-channel, or per-group. Outliers are often not uniformly distributed across a large weight tensor. They might be concentrated in specific output channels of a linear layer, for instance.
Using finer granularity increases the metadata overhead (more scale/zero-point values to store) and might require specific hardware support for optimal performance, but it significantly improves robustness to localized outliers.
Advanced Techniques (Looking Ahead): Methods like SmoothQuant (discussed in Chapter 3) explicitly tackle the issue of activation outliers, which are often more problematic than weight outliers in LLMs. SmoothQuant works by mathematically migrating the quantization difficulty from activations (which have a dynamic range that's hard to quantize) to weights (which are static and easier to quantize) without changing the layer's mathematical function. This often allows for effective INT8 quantization even when activations contain significant outliers.
Choosing the appropriate outlier handling strategy often involves experimentation. Starting with MinMax calibration on a representative dataset, evaluating accuracy, and then trying clipping or finer granularity if accuracy drops significantly is a common workflow. Monitoring the distribution of weights and activations can also provide insights into whether outliers are a likely cause of quantization errors.
© 2025 ApX Machine Learning