While quantizing model weights, as discussed previously, offers significant memory savings, quantizing activations introduces its own set of challenges and considerations. Activations, the outputs of neuron layers passed as inputs to the next, are inherently dynamic. Unlike weights, which are fixed after training, activation values change with every input sample processed by the model. This dynamic nature makes their quantization more complex but equally important for achieving maximum inference speedup and memory bandwidth reduction, especially on hardware with specialized low-precision computation units.

The Challenge: Dynamic Range and Sensitivity

The primary difficulty in quantizing activations stems from their potentially wide and unpredictable dynamic range. Consider the outputs of activation functions like ReLU or GeLU. These non-linearities can produce values spanning several orders of magnitude, and unlike weights which often follow a somewhat predictable distribution (e.g., centered around zero), activation distributions can vary significantly depending on the input data and the specific layer in the network.

Activation ranges vary significantly across different layers within a Transformer block. Non-linear functions like ReLU or GeLU in the Feed-Forward Network (FFN) can particularly expand the dynamic range.

Quantizing a tensor with such a wide range using a low-precision format (like INT8) forces a trade-off. If you scale the quantization to accommodate the extreme maximum and minimum values (outliers), the resolution for representing the majority of values (which might be clustered in a much smaller range) becomes very coarse. This loss of precision, known as quantization error, can significantly degrade model accuracy. Conversely, if you optimize the scale for the dense cluster of values, outliers will be clipped, potentially losing important information. Activations, especially within attention mechanisms or intermediate FFN layers, can be particularly sensitive to this clipping or loss of resolution.

Calibration: Determining Quantization Parameters

To effectively map the floating-point range of activations to a lower-precision integer range, we need to determine appropriate quantization parameters: a scale factor ( $s$ ) and a zero-point ( $z$ ). The general mapping is:

\text{quantized\_value} = \text{round}(\frac{\text{float\_value}}{s}) + z

The process of finding optimal $s$ and $z$ values is called calibration. It typically involves feeding a representative dataset (a subset of the training or validation data, often a few hundred to a few thousand samples) through the model and observing the range of activation values at different points in the network. Several calibration techniques exist:

Min-Max Calibration: This is the simplest method. Record the minimum ( $x_{min}$ ) and maximum ( $x_{max}$ ) activation values observed during calibration. The scale $s$ and zero-point $z$ are then calculated to map this observed range $[x_{min}, x_{max}]$ to the target integer range (e.g., $[-128, 127]$ for signed INT8).
- Pros: Simple to implement.
- Cons: Highly sensitive to outliers. A single extreme value can drastically worsen the quantization resolution for typical values.
Mean Squared Error (MSE) Calibration: This approach iterates through different potential clipping ranges (thresholds) within the observed min-max range. For each threshold, it calculates the quantization parameters and measures the average squared error between the original floating-point activations and their quantized-dequantized equivalents. The threshold that minimizes this MSE is chosen.
- Pros: Often finds a better trade-off between clipping outliers and maintaining resolution for the bulk of the distribution compared to simple Min-Max.
- Cons: More computationally intensive than Min-Max due to the search over thresholds.
Entropy (KL Divergence) Calibration: This method aims to minimize the information loss during quantization. It selects quantization parameters ( $s$ and $z$ ) such that the Kullback-Leibler (KL) divergence between the distribution of the original floating-point activations and the distribution of the quantized-dequantized activations is minimized.
- Pros: Often considered the most effective, particularly for non-uniform distributions, as it directly tries to preserve the shape of the original activation distribution.
- Cons: Can be the most computationally complex calibration method.

Here's a PyTorch-style illustration of how observers are used during calibration for a specific activation tensor:

import torch

# Assume 'activation_tensor' holds the activations from a specific layer
# during a forward pass with calibration data.

# --- Calibration Phase ---
# Observer object tracks statistics
# Example: MinMaxObserver
class MinMaxObserver:
    def __init__(self):
        self.min_val = torch.tensor(float('inf'))
        self.max_val = torch.tensor(float('-inf'))

    def forward(self, x):
        # Detach tensor to avoid tracking
        # gradients during observation
        x_detached = x.detach()
        self.min_val = torch.min(x_detached, self.min_val)
        self.max_val = torch.max(x_detached, self.max_val)
        return x # Pass input through unmodified
                 # during calibration

    def calculate_qparams(self, dtype=torch.qint8):
        # Determine scale and zero_point based on
        # observed min/max
        qmin = torch.iinfo(dtype).min
        qmax = torch.iinfo(dtype).max

        scale = (self.max_val - self.min_val) / float(qmax - qmin)

        # Ensure scale is not zero
        if scale == 0.0:
            scale = torch.tensor(1e-8) # Small epsilon or
                                       # handle appropriately

        zero_point = qmin - torch.round(self.min_val / scale)
        zero_point = torch.clamp(zero_point, qmin, qmax)
        zero_point = zero_point.to(torch.int) # Clamp to valid
                                              # range

        return scale, zero_point

# In your model's forward pass during calibration:
# observer = MinMaxObserver() # Or other observer type
                             # (MSE, Entropy based)
# ... layer computation ...
# activations = some_layer(input)
# activations = observer(activations) # Observer watches
                                    # the activations
# ... rest of forward pass ...

# After running calibration data through:
# scale, zero_point = observer.calculate_qparams()
# print(f"Calculated Scale: {scale}, Zero-Point: {zero_point}")

# --- Inference Phase ---
# Use the calculated scale and zero_point
# to quantize activations
# quantized_activations = torch.quantize_per_tensor(
#     activation_tensor, scale, zero_point, torch.qint8
# )

Per-Tensor vs. Finer-Grained Quantization

Just like with weights, activations can be quantized using different granularities:

Per-Tensor: A single scale factor $s$ and zero-point $z$ are used for the entire activation tensor produced by a layer. This is the simplest approach with the lowest overhead.
Per-Channel/Group: For convolutional layers or linear layers where distinct channels might have different activation statistics, using separate quantization parameters per channel or group of channels can improve accuracy.
Per-Token (Transformers): In Transformers, activations often have dimensions corresponding to sequence length and hidden size. Quantizing per-token (calculating $s$ and $z$ independently for the hidden dimension vector of each token in the sequence) can sometimes capture variations along the sequence length dimension more accurately, though it adds overhead.

The choice depends on the specific layer type, the observed distribution of activations, and the acceptable performance overhead. Per-tensor is common due to its simplicity and efficiency, but finer-grained quantization might be necessary to recover accuracy in sensitive layers.

Handling Outliers During Calibration

As mentioned, outliers heavily influence Min-Max calibration and can negatively impact MSE/Entropy methods too. A common technique to mitigate this is clipping. Before calculating the quantization parameters, the observed activation range is clipped based on percentiles. For instance, instead of using the absolute minimum and maximum, one might use the 1st and 99th percentile values, or the 0.1th and 99.9th percentile values.

Clipping removes extreme outliers before calculating quantization parameters, potentially improving resolution for the majority of values at the cost of saturating the clipped outliers.

Clipping helps focus the quantization range on the bulk of the data, improving precision for typical values. However, it introduces saturation for the clipped outliers, which might be detrimental if these outliers carry significant information. Choosing the right clipping threshold often requires empirical tuning.

Static vs. Dynamic Quantization

Activation quantization parameters can be determined either offline (static) or on-the-fly (dynamic):

Static Quantization (Post-Training Static Quantization - PTSQ): This is the most common approach for performance-sensitive applications. The scale $s$ and zero-point $z$ for each activation tensor are determined once using the calibration procedure described above. These fixed parameters are then stored and used during inference. This allows the quantization and potential fusion with preceding operations to happen efficiently without runtime overhead for range calculation.
Dynamic Quantization (Post-Training Dynamic Quantization - PTDQ): In this method, the min/max range (and thus $s$ and $z$ ) of an activation tensor is calculated dynamically for each input during inference. This avoids the need for a calibration dataset. While simpler to apply, the runtime cost of calculating the range often negates any computational speedup from using lower-precision arithmetic. Its main benefit is typically reduced memory footprint and bandwidth, as the activations can still be stored and moved in a lower-precision format, even if computations involve dynamic dequantization. For LLMs where latency is critical, dynamic quantization of activations is less frequently used than static quantization.

Role in Quantization-Aware Training (QAT)

Quantization-Aware Training directly incorporates the simulation of quantization effects (for both weights and activations) into the training loop. It uses "fake quantization" nodes that simulate the process of quantizing and dequantizing activations during the forward pass, while allowing gradients to flow through during the backward pass.

import torch
import torch.nn as nn
import torch.ao.quantization as quant

class QuantizableLayer(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(128, 256)
        self.relu = nn.ReLU()
        # QAT uses observers during training which double as fake quantizers
        # Placeholder for where activation quantization simulation happens
        self.activation_quant_stub = quant.QuantStub()
        self.activation_dequant_stub = quant.DeQuantStub()

    def forward(self, x):
        # Simulate quantization of input activation to this layer
        x = self.activation_quant_stub(x)

        x = self.linear(x)
        x = self.relu(x)

        # Dequantize output activation before passing to next
        # (potentially float) layer
        x = self.activation_dequant_stub(x)
        return x

# During QAT, these stubs (along with weight fake quantization)
# simulate quantization errors, allowing the model to adapt.
# After QAT, the model can be converted to a true quantized model
# using the statistics gathered by the observers within the stubs.

By exposing the model to quantization noise during training, QAT allows the network to adapt its weights and activation distributions to become more resilient to quantization errors. This often yields significantly better accuracy than Post-Training Quantization (PTQ), especially when targeting very low bit-widths (like INT4) or when dealing with highly sensitive activation distributions.

Summary

Important considerations include:

Dynamic Range: Activations vary with input and can have wide, unpredictable ranges, making them sensitive to quantization errors.
Calibration: Selecting appropriate quantization parameters ( $s, z$ ) using representative data is important. Methods like Min-Max, MSE, and Entropy offer different trade-offs between simplicity and robustness.
Outlier Handling: Techniques like clipping are often necessary to prevent extreme values from dominating the quantization range.
Granularity: Per-tensor quantization is common, but finer-grained (per-channel, per-token) quantization might be needed for accuracy.
Static vs. Dynamic: Static quantization (parameters fixed post-calibration) is generally preferred for performance; dynamic quantization avoids calibration but adds runtime overhead.
QAT: Training the model with simulated quantization often yields the best accuracy, especially for aggressive quantization schemes.

Successfully navigating these considerations allows for significant reductions in memory bandwidth and potential computational speedups, making large models more practical for deployment.

Was this section helpful?