As we saw previously, basic Post-Training Quantization (PTQ) methods can struggle when applied aggressively, particularly for lower bit-widths like INT4 or even INT8 for both weights and activations. A primary reason for this difficulty lies in the distribution of values within the model, specifically the presence of large magnitude values, often called outliers, in the activation tensors.
Large Language Models frequently exhibit activation tensors where a few values are significantly larger than the rest. This is common in layers like attention mechanisms or intermediate outputs of MLP blocks. When quantizing these activations, these outliers dramatically expand the required range [min(X),max(X)] used to determine the quantization scale and zero-point. If you use a per-tensor quantization scheme, the scale factor gets stretched to accommodate these few large values. This forces the majority of the smaller, more common activation values into a very small number of quantization bins, leading to a significant loss of precision and ultimately, reduced model accuracy.
Imagine trying to measure the heights of people in a room that includes both average humans and a towering basketball player using a single measuring stick marked only in feet. The basketball player forces you to use a scale that lacks the finer resolution (inches) needed to accurately represent the height differences among the average-sized individuals.
SmoothQuant offers an elegant solution to this problem. Instead of directly trying to quantize activations with difficult distributions, SmoothQuant aims to migrate the quantization difficulty from the activations (which are hard to quantize due to outliers) to the weights (which are generally easier to quantize). It achieves this by applying a mathematically equivalent transformation before quantization.
Consider a typical matrix multiplication in an LLM layer: Y=XW Here, X represents the input activation tensor, and W represents the weight tensor. SmoothQuant introduces a diagonal scaling matrix S (meaning all off-diagonal elements are zero) and transforms the equation like this: Y=XS−1SW=(XS−1)(SW) Let's define the transformed activations and weights: X^=XS−1 W^=SW The computation remains mathematically identical: Y=X^W^. However, the key idea is to choose the scaling factors in S carefully. The goal is to make the transformed activations X^ "smoother" – meaning they have a smaller dynamic range (i.e., max(∣X^∣) is reduced compared to max(∣X∣)) and are therefore easier to quantize accurately.
The scaling factors applied to the activations (S−1) are counteracted by applying the inverse scaling (S) to the weights. This means the values in W^ will generally become larger than those in W. However, weights often have more well-behaved distributions than activations with outliers, making them more resilient to this increase in magnitude during quantization. We've effectively shifted the scaling challenge from the problematic activations to the more manageable weights.
The scaling factors sj (the diagonal elements of S) are typically calculated on a per-channel basis. The calculation aims to balance the dynamic range between the activations and weights for each output channel j. A common way to determine sj is: sj=max(∣Wj∣)1−αmax(∣Xj∣)α Here:
Using powers of 2 for scaling factors can sometimes be beneficial for hardware efficiency, though the principle remains the same.
The chart above illustrates conceptually how SmoothQuant reduces the maximum absolute value (dynamic range) of activations while potentially increasing the range of weights, making activations easier to quantize.
SmoothQuant offers several advantages:
However, there are points to consider:
Compared to other advanced PTQ methods, SmoothQuant specifically targets the interplay between activation and weight distributions. While methods like GPTQ focus on sophisticated weight quantization and AWQ protects salient weights based on activation magnitudes, SmoothQuant performs an upfront mathematical smoothing operation to make both activations and weights more amenable to standard quantization techniques afterwards. It's a valuable tool for achieving robust low-precision quantization, especially when activation outliers are a primary concern.
© 2025 ApX Machine Learning