Masterclass
While quantizing model weights, as discussed previously, offers significant memory savings, quantizing activations introduces its own set of challenges and considerations. Activations, the outputs of neuron layers passed as inputs to the next, are inherently dynamic. Unlike weights, which are fixed after training, activation values change with every input sample processed by the model. This dynamic nature makes their quantization more complex but equally important for achieving maximum inference speedup and memory bandwidth reduction, especially on hardware with specialized low-precision computation units.
The primary difficulty in quantizing activations stems from their potentially wide and unpredictable dynamic range. Consider the outputs of activation functions like ReLU or GeLU. These non-linearities can produce values spanning several orders of magnitude, and unlike weights which often follow a somewhat predictable distribution (e.g., centered around zero), activation distributions can vary significantly depending on the input data and the specific layer in the network.
Activation ranges vary significantly across different layers within a Transformer block. Non-linear functions like ReLU or GeLU in the Feed-Forward Network (FFN) can particularly expand the dynamic range.
Quantizing a tensor with such a wide range using a low-precision format (like INT8) forces a trade-off. If you scale the quantization to accommodate the extreme maximum and minimum values (outliers), the resolution for representing the majority of values (which might be clustered in a much smaller range) becomes very coarse. This loss of precision, known as quantization error, can significantly degrade model accuracy. Conversely, if you optimize the scale for the dense cluster of values, outliers will be clipped, potentially losing important information. Activations, especially within attention mechanisms or intermediate FFN layers, can be particularly sensitive to this clipping or loss of resolution.
To effectively map the floating-point range of activations to a lower-precision integer range, we need to determine appropriate quantization parameters: a scale factor (s) and a zero-point (z). The general mapping is:
quantized_value=round(sfloat_value​)+zThe process of finding optimal s and z values is called calibration. It typically involves feeding a representative dataset (a subset of the training or validation data, often a few hundred to a few thousand samples) through the model and observing the range of activation values at different points in the network. Several calibration techniques exist:
Min-Max Calibration: This is the simplest method. Record the minimum (xmin​) and maximum (xmax​) activation values observed during calibration. The scale s and zero-point z are then calculated to map this observed range [xmin​,xmax​] to the target integer range (e.g., [−128,127] for signed INT8).
Mean Squared Error (MSE) Calibration: This approach iterates through different potential clipping ranges (thresholds) within the observed min-max range. For each threshold, it calculates the quantization parameters and measures the average squared error between the original floating-point activations and their quantized-dequantized equivalents. The threshold that minimizes this MSE is chosen.
Entropy (KL Divergence) Calibration: This method aims to minimize the information loss during quantization. It selects quantization parameters (s and z) such that the Kullback-Leibler (KL) divergence between the distribution of the original floating-point activations and the distribution of the quantized-dequantized activations is minimized.
Here's a PyTorch-style illustration of how observers are used during calibration for a specific activation tensor:
import torch
# Assume 'activation_tensor' holds the activations from a specific layer
# during a forward pass with calibration data.
# --- Calibration Phase ---
# Observer object tracks statistics
# Example: MinMaxObserver
class MinMaxObserver:
def __init__(self):
self.min_val = torch.tensor(float('inf'))
self.max_val = torch.tensor(float('-inf'))
def forward(self, x):
# Detach tensor to avoid tracking
# gradients during observation
x_detached = x.detach()
self.min_val = torch.min(x_detached, self.min_val)
self.max_val = torch.max(x_detached, self.max_val)
return x # Pass input through unmodified
# during calibration
def calculate_qparams(self, dtype=torch.qint8):
# Determine scale and zero_point based on
# observed min/max
qmin = torch.iinfo(dtype).min
qmax = torch.iinfo(dtype).max
scale = (self.max_val - self.min_val) / float(qmax - qmin)
# Ensure scale is not zero
if scale == 0.0:
scale = torch.tensor(1e-8) # Small epsilon or
# handle appropriately
zero_point = qmin - torch.round(self.min_val / scale)
zero_point = torch.clamp(zero_point, qmin, qmax)
zero_point = zero_point.to(torch.int) # Clamp to valid
# range
return scale, zero_point
# In your model's forward pass during calibration:
# observer = MinMaxObserver() # Or other observer type
# (MSE, Entropy based)
# ... layer computation ...
# activations = some_layer(input)
# activations = observer(activations) # Observer watches
# the activations
# ... rest of forward pass ...
# After running calibration data through:
# scale, zero_point = observer.calculate_qparams()
# print(f"Calculated Scale: {scale}, Zero-Point: {zero_point}")
# --- Inference Phase ---
# Use the calculated scale and zero_point
# to quantize activations
# quantized_activations = torch.quantize_per_tensor(
# activation_tensor, scale, zero_point, torch.qint8
# )
Just like with weights, activations can be quantized using different granularities:
The choice depends on the specific layer type, the observed distribution of activations, and the acceptable performance overhead. Per-tensor is common due to its simplicity and efficiency, but finer-grained quantization might be necessary to recover accuracy in sensitive layers.
As mentioned, outliers heavily influence Min-Max calibration and can negatively impact MSE/Entropy methods too. A common technique to mitigate this is clipping. Before calculating the quantization parameters, the observed activation range is clipped based on percentiles. For instance, instead of using the absolute minimum and maximum, one might use the 1st and 99th percentile values, or the 0.1th and 99.9th percentile values.
Clipping removes extreme outliers before calculating quantization parameters, potentially improving resolution for the majority of values at the cost of saturating the clipped outliers.
Clipping helps focus the quantization range on the bulk of the data, improving precision for typical values. However, it introduces saturation for the clipped outliers, which might be detrimental if these outliers carry significant information. Choosing the right clipping threshold often requires empirical tuning.
Activation quantization parameters can be determined either offline (static) or on-the-fly (dynamic):
Quantization-Aware Training directly incorporates the simulation of quantization effects (for both weights and activations) into the training loop. It uses "fake quantization" nodes that simulate the process of quantizing and dequantizing activations during the forward pass, while allowing gradients to flow through during the backward pass.
import torch
import torch.nn as nn
import torch.ao.quantization as quant
class QuantizableLayer(nn.Module):
def __init__(self):
super().__init__()
self.linear = nn.Linear(128, 256)
self.relu = nn.ReLU()
# QAT uses observers during training which double as fake quantizers
# Placeholder for where activation quantization simulation happens
self.activation_quant_stub = quant.QuantStub()
self.activation_dequant_stub = quant.DeQuantStub()
def forward(self, x):
# Simulate quantization of input activation to this layer
x = self.activation_quant_stub(x)
x = self.linear(x)
x = self.relu(x)
# Dequantize output activation before passing to next
# (potentially float) layer
x = self.activation_dequant_stub(x)
return x
# During QAT, these stubs (along with weight fake quantization)
# simulate quantization errors, allowing the model to adapt.
# After QAT, the model can be converted to a true quantized model
# using the statistics gathered by the observers within the stubs.
By exposing the model to quantization noise during training, QAT allows the network to adapt its weights and activation distributions to become more robust to quantization errors. This often yields significantly better accuracy than Post-Training Quantization (PTQ), especially when targeting very low bit-widths (like INT4) or when dealing with highly sensitive activation distributions.
Quantizing activations is essential for maximizing the benefits of model compression, particularly for latency reduction on compatible hardware. However, it requires careful handling due to the dynamic range of activations. Key considerations include:
Successfully navigating these considerations allows for significant reductions in memory bandwidth and potential computational speedups, making large models more practical for deployment.
© 2025 ApX Machine Learning