While quantizing the large linear layers in Large Language Models (LLMs) often yields significant performance gains, other essential components, particularly attention mechanisms and normalization layers, present unique hurdles. Naively applying standard quantization techniques to these parts can disproportionately degrade model accuracy because their operations are often more sensitive to numerical precision than simple matrix multiplications. Understanding these sensitivities is significant for developing effective quantization strategies.
The self-attention mechanism is fundamental to Transformer architectures, enabling models to weigh the importance of different tokens when processing sequence data. However, several steps within the attention calculation are sensitive to quantization noise.
The softmax function, used to convert raw attention scores (QKT) into probabilities, is notoriously difficult to quantize effectively. Its exponential nature means small changes in input values, especially values near zero, can lead to large changes in the output probabilities. Furthermore, the output distribution is constrained between 0 and 1. Quantization, particularly low-bit quantization, introduces noise that can distort these probabilities, potentially causing the model to attend to incorrect tokens or distribute attention too broadly or too narrowly.
Softmax(zi)=∑j=1Kezjezifor i=1,…,KQuantizing the inputs zi (the raw attention scores) or intermediate values within the softmax computation can lead to significant errors in the final attention weights. Due to this sensitivity, a common strategy is to keep the softmax computation in higher precision, such as FP16 or even FP32, even if the surrounding matrix multiplications (Q, K, V projections) are quantized to lower bit-widths like INT8 or INT4.
The intermediate result of the query-key dot products (QKT) often exhibits a high dynamic range. Some scores might be very large, while others are small but still potentially important for capturing subtle relationships. Standard quantization methods (like min-max scaling) struggle with such distributions. If the quantization range is set to accommodate the large outlier scores, the precision for the smaller scores becomes very coarse, effectively washing them out.
Consider the scaling factor α in symmetric quantization:
α=2b−1−1max(∣X∣)where X is the tensor being quantized and b is the bit-width. A large maximum absolute value max(∣X∣) leads to a large α, resulting in a large quantization step size. This means smaller values in X are mapped to values near zero after quantization, losing their information content.
Strategies to address this include:
The final step in attention involves aggregating the Value vectors (V) weighted by the attention probabilities computed via softmax. Errors introduced during the quantization of V or the attention probabilities can accumulate during this weighted sum, leading to inaccuracies in the attention mechanism's output.
Normalization layers like Layer Normalization (LayerNorm) and RMS Normalization (RMSNorm) are essential for stabilizing training and improving LLM performance. They operate by standardizing the activations within a layer.
LayerNorm computes the mean (μ) and variance (σ2) of activations within a layer, normalizes the activations, and then applies learnable scale (γ) and shift (β) parameters:
LayerNorm(x)=γσ2+ϵx−μ+βRMSNorm is a simpler variant that normalizes using the root mean square statistic, omitting the centering (mean subtraction) and the shift parameter (β):
RMSNorm(x)=γRMS(x)2+ϵxwhereRMS(x)=n1i=1∑nxi2The core challenge in quantizing normalization layers lies in the calculation of the statistics (μ, σ2, or RMS). These statistics are computed across the activation values. If the activations themselves are quantized with low precision, the resulting noise can significantly perturb the calculated mean, variance, or RMS. This error then propagates through the normalization formula, potentially destabilizing the network or altering its representational capacity.
For example, quantizing the inputs x before calculating μ and σ2 can lead to inaccurate estimates of these statistics. The division operation (by σ2+ϵ or RMS(x)2+ϵ) is also sensitive to errors in the denominator, especially when the variance or RMS is small.
Given these challenges, a purely uniform low-bit quantization across the entire LLM is often suboptimal. A more practical and effective strategy is mixed-precision quantization. This involves selectively applying different precision levels to different parts of the model based on their sensitivity.
Typically:
The diagram below illustrates a potential mixed-precision scheme within a Transformer block segment.
A simplified view of potential mixed-precision application within a Transformer block, highlighting higher precision for softmax and normalization calculations while using lower precision for linear projections. Input/output types depend on the overall model configuration.
Successfully quantizing LLMs requires moving beyond treating the model as a uniform sequence of operations. Attention and normalization layers demand specific consideration due to their operational characteristics and sensitivity to numerical precision. Employing mixed-precision approaches, potentially combined with QAT and specialized quantization schemes, is frequently necessary to achieve substantial performance improvements without unacceptable losses in model accuracy. The techniques discussed in previous chapters for PTQ (like GPTQ/AWQ) primarily focus on linear layers; adapting or combining these with strategies for sensitive components is a key aspect of advanced quantization practice.
© 2025 ApX Machine Learning