As we push quantization to lower bit depths, like INT4 or even below, we often encounter a significant obstacle: the presence of outlier values in both weights and activations. These outliers are values with unusually large magnitudes compared to the majority of values in a tensor. While they might represent only a small fraction of the total elements, their disproportionate influence can severely degrade the accuracy of quantized models.
Why are outliers so problematic for quantization? Standard quantization techniques typically map the entire range of floating-point values within a tensor to a fixed number of discrete integer levels. A single large outlier can dramatically expand this required range.
Consider a tensor X that we want to quantize. A common approach is to find the maximum absolute value, max(∣X∣), and use it to determine the scaling factor. If X contains an outlier with a very large magnitude, max(∣X∣) will be dominated by this outlier. This forces the quantization step size, Δ, to be large to cover the entire range.
Δ=2b−12⋅max(∣X∣)
(for symmetric quantization to b bits)
A large Δ means that the smaller, more numerous values in the tensor are mapped to a very limited set of integer levels, potentially even collapsing many distinct floating-point values to the same integer representation. This loss of precision for the majority of values leads to significant quantization error and, consequently, noticeable accuracy degradation in the LLM's performance. Activation outliers are particularly challenging because their values are input-dependent and can vary significantly during inference.
Identifying Outliers
The first step in handling outliers is identifying them. A common approach is to analyze the distribution of values within weight tensors and representative activation tensors (obtained using calibration data).
- Weight Outliers: These are relatively straightforward to identify as weights are static after training. We can simply scan the weight tensors of the pre-trained model. Histograms or statistical measures like standard deviation can reveal the presence of values far from the mean.
- Activation Outliers: Identifying activation outliers requires passing calibration data through the model and recording the intermediate activation values. Since activations change with each input, we need a representative sample to understand typical activation distributions and identify recurring outlier patterns.
A visualization often helps. Consider the distribution of values in a specific layer's weight tensor.
A typical weight distribution in an LLM layer. Most values are clustered near zero, but a few high-magnitude outliers exist, significantly extending the required quantization range.
Mitigation Strategies
Once identified, several strategies can be employed to mitigate the negative impact of outliers.
1. Clipping
The simplest approach is clipping. Values exceeding a certain threshold are clamped to that threshold before quantization.
- How it works: Define a clipping threshold, τ. Any value ∣x∣>τ is set to sign(x)⋅τ. Quantization is then performed on the clipped tensor.
- Choosing τ: This is the main challenge. Setting τ too low clips many informative values, hurting accuracy. Setting it too high fails to mitigate the outlier effect. Common strategies involve selecting a threshold based on percentiles (e.g., clipping the top 0.1% of magnitudes) or based on a multiple of the standard deviation. Requires careful tuning using a calibration set.
- Pros: Simple to implement.
- Cons: Can discard important information contained in the true outlier values. Finding the optimal τ can be difficult and data-dependent.
2. Per-Channel or Per-Group Quantization
Instead of using a single scaling factor for an entire tensor (per-tensor quantization), we can use separate scaling factors for smaller subsets of the data.
- Per-Channel: For weight tensors in layers like
Linear
or Conv2D
, we can compute separate scaling factors for each output channel. This isolates the impact of an outlier within one channel from affecting the quantization of others. If a weight matrix has shape [output_features, input_features]
, we compute output_features
different scaling factors.
- Per-Group (Fine-grained): We can further divide channels or tokens into smaller groups and quantize each group independently. For instance, grouping weights along the input feature dimension (e.g., groups of 64 or 128 weights) can further isolate outliers.
- Pros: Significantly reduces the impact of outliers by localizing the quantization range. Generally improves accuracy compared to per-tensor quantization, especially at low bit depths. Supported by many hardware backends.
- Cons: Increases the metadata overhead (more scaling factors to store). May introduce slight computational overhead during dequantization compared to per-tensor.
3. Mixed-Precision Quantization
Not all parts of a model need to be quantized to the same bit depth. We can strategically use higher precision for layers or tensors identified as being particularly sensitive to outliers or quantization noise.
- How it works: Identify layers or tensor types (e.g., specific attention components, layer norms, or layers with significant outlier presence) that suffer most from low-bit quantization. Keep these components in higher precision (e.g., FP16 or INT8) while quantizing the bulk of the model (e.g., large linear layers) more aggressively (e.g., INT4).
- Pros: Offers a flexible way to balance performance gains and accuracy preservation. Can specifically target problematic areas.
- Cons: Requires careful analysis to identify sensitive components. Can complicate the deployment pipeline, as inference engines need to support mixed-precision execution efficiently.
4. Activation-Aware Weight Quantization (AWQ)
AWQ recognizes that not all weights are equally important. Weights connected to activations with consistently large magnitudes are more sensitive to quantization errors.
- How it works: AWQ identifies salient weights (those multiplied by large activation inputs) based on calibration data. It then scales the weights before quantization in a way that reduces the quantization error for these salient weights, often at the expense of less important weights. This is achieved by scaling entire output channels of a weight matrix and applying the inverse scaling to the subsequent layer's activations (or input features), preserving the mathematical equivalence of the layer's output. This scaling factor is chosen to minimize the quantization error on the weights, effectively reducing the impact of weight outliers that correspond to large activation magnitudes.
- Pros: Specifically designed to handle the interplay between weights and activation magnitudes. Often achieves better accuracy preservation than simple clipping or per-tensor quantization, especially for weights.
- Cons: Relies on representative calibration data. Introduces modifications to weights that need to be accounted for during inference (often absorbed into layer scales).
5. SmoothQuant
SmoothQuant addresses the specific challenge of activation outliers, particularly when adjacent weights also have large variations. Large activation outliers often occur in specific channels across multiple tokens.
- How it works: SmoothQuant migrates the quantization difficulty from activations (hard to quantize due to outliers) to weights (easier to quantize). It does this by applying a mathematically equivalent scaling transformation: it scales down the problematic activation channels by a factor s and scales up the corresponding weights (along the input channel dimension) by the same factor s.
Y=(X⋅diag(s)−1)⋅(diag(s)⋅W)=XW
This smoothing factor s is chosen per-channel to balance the dynamic ranges of activations and weights, making both easier to quantize accurately using standard techniques like per-tensor quantization.
- Pros: Directly tackles activation outliers. Relatively simple concept that preserves mathematical equivalence. Can enable per-tensor quantization for activations where it previously failed.
- Cons: Requires calibration data to determine the smoothing factors. Modifies weights, potentially interacting with other techniques like weight-only quantization if not handled carefully. Adds a scaling operation or requires adjusted weights/scales during inference.
The SmoothQuant technique transforms activations (X) and weights (W) into scaled versions (X' and W') such that the final output remains the same, but the scaled tensors have distributions more amenable to quantization.
Choosing the Right Strategy
The best strategy often depends on the specific model architecture, the target bit depth, the hardware constraints, and the acceptable accuracy tolerance.
- For moderate quantization (e.g., INT8), per-channel weight quantization might suffice.
- For aggressive quantization (e.g., INT4 or lower), techniques like SmoothQuant (for activations) and AWQ (for weights), potentially combined with per-group or per-channel quantization, are often necessary.
- Mixed-precision provides a fallback for highly sensitive parts of the model.
- Clipping is a simpler starting point but often requires careful tuning and may yield suboptimal results compared to more advanced methods.
Handling outliers is a practical necessity for achieving good performance with aggressively quantized LLMs. By understanding their impact and applying targeted mitigation techniques, you can significantly improve the accuracy-performance trade-off for your deployed models. The next section will explore the specific challenges related to quantizing different components within an LLM, such as attention mechanisms.