Basic Post-Training Quantization (PTQ) methods, like MinMax calibration, treat all weights within a layer more or less equally when determining quantization parameters. However, this uniform approach can sometimes lead to problems, especially when targeting very low precisions like 4-bit integers (INT4). Why? Because not all weights contribute equally to the model's output. Some weights are multiplied by activations with consistently large magnitudes, making their precise representation more important for preserving the overall function of the layer. Conversely, quantizing weights associated with small activations might have a lesser impact.
Activation-aware Weight Quantization (AWQ) addresses this observation directly. It's an advanced PTQ technique designed to protect the weights that matter most by analyzing the typical magnitudes of activations encountered during inference.
The central principle behind AWQ is that quantization error is more detrimental for weights multiplied by large activation values. Consider a simple matrix multiplication in a linear layer: Y=WX, where W is the weight matrix and X is the input activation vector. If a particular weight Wij is frequently multiplied by a large activation value Xj, any error introduced when quantizing Wij will be amplified, significantly impacting the output Yi.
AWQ proposes that we should prioritize preserving the precision of these "salient" weights. It identifies these weights not by looking at the weights themselves, but by examining the activations they interact with. Using a small calibration dataset, AWQ observes the distribution of activation magnitudes for each input channel to a weight matrix. Channels that consistently show large activation values indicate that the corresponding weights are more critical.
AWQ estimates the importance of each weight channel based on the activation scale. It processes the calibration data through the model, recording the activation statistics (typically the maximum absolute value or a high percentile) for each input feature channel feeding into a linear layer. A small fraction of channels (e.g., 1% or even 0.1%) consistently exhibiting the largest activation magnitudes are deemed the most important. The weights connected to these activation channels are the ones AWQ aims to protect during quantization.
Instead of devising a complex, non-uniform quantization scheme, AWQ uses a clever pre-processing step: per-channel weight scaling. The goal is to reduce the dynamic range of the salient weight groups before applying standard quantization.
Imagine a weight matrix W. AWQ calculates a scaling factor sj for each input channel j. This factor is determined primarily by the observed magnitude of the corresponding activation channel Xj. A common approach is to set the scaling factor sj such that the range of the scaled weights W:,j′=W:,j/sj is reduced, particularly for channels j where activations Xj are large.
To maintain the mathematical equivalence of the layer's computation, this scaling must be compensated for. The original operation Y=WX can be rewritten as:
Y=(WS−1)(SX)Where S is a diagonal matrix containing the scaling factors sj. The weights W are scaled down by S−1 (element-wise division by sj for each column j), resulting in W′=WS−1. Quantization, typically symmetric per-channel INT4, is then applied to this scaled weight matrix W′. The corresponding activations X are scaled up by S before the multiplication.
This scaling operation effectively transfers some of the quantization difficulty from the weights to the activations. AWQ operates on the premise that activations are often easier to represent accurately, or that the scaling can sometimes be fused or absorbed into preceding operations (like Layer Normalization) with minimal overhead.
The scaling factor sj for channel j is often calculated based on both activation and weight ranges to balance the scaling, for instance:
sj=max(∣W:,j∣)max(∣Xj∣)αHere, max(∣Xj∣) is the maximum absolute value observed for activation channel j in the calibration set, max(∣W:,j∣) is the maximum absolute value of the corresponding weight channel (column), and α is a hyperparameter (often between 0.5 and 1) that controls the strength of the scaling based on activation magnitude. A higher α gives more importance to the activation scale.
Conceptual flow of AWQ. The original weight matrix W is scaled per channel based on activation statistics, resulting in W′. This W′ is quantized. The activations X are scaled inversely, X′, before the multiplication to preserve the output.
AWQ offers several advantages:
However, there are considerations:
AWQ has proven effective for quantizing various LLMs down to 4-bit precision with minimal accuracy loss. Libraries and frameworks often provide implementations that handle the calibration, scaling calculation, and application during inference. When using AWQ, you typically provide the pre-trained model and a small calibration dataset. The process outputs the quantized weights and the necessary scaling factors, ready for deployment.
Compared to SmoothQuant, which also addresses activation outliers, AWQ's approach is different. SmoothQuant migrates quantization difficulty from activations to weights via static scaling. AWQ selectively scales weights based on activation importance, aiming to make the quantization of important weights easier. Both techniques aim to improve low-bit quantization accuracy but use distinct mechanisms. Compared to GPTQ, AWQ avoids the computationally more intensive Hessian matrix approximations used in GPTQ's layer-wise optimization.
© 2025 ApX Machine Learning