Post-Training Quantization (PTQ) offers two primary approaches for handling model activations: static quantization and dynamic quantization. The main difference lies in when the quantization parameters (scale factor and zero-point) for the activations are determined and applied. Weights are typically always quantized offline (before inference) in both schemes.
Static Quantization
Static quantization involves quantizing both the model weights and activations before performing inference. Since the range of activations can vary depending on the input data, static quantization requires a calibration step.
- Calibration: A representative dataset (calibration dataset) is passed through the model in its original precision (e.g., FP32). The typical range of values (minimum and maximum) for the activations at various points in the model (e.g., outputs of specific layers) is observed and recorded.
- Parameter Calculation: Based on the observed ranges during calibration, fixed quantization parameters (scale and zero-point) are calculated for the activations at different quantization points within the model. These parameters are chosen to represent the observed activation distribution reasonably well.
- Offline Quantization: Both weights and activations (using the calculated static parameters) are quantized and stored in a lower-precision format like INT8.
- Inference: During inference, the model performs computations primarily using integer arithmetic, as both weights and activation quantization parameters are predetermined. The input data is quantized using the parameters determined during calibration for the first layer's input, and intermediate activations are kept in the quantized format between layers where possible.
Static quantization workflow: Calibration determines activation parameters offline, enabling efficient integer-only computations during inference.
Advantages:
- Higher Potential Performance: Because all quantization parameters are fixed beforehand, the inference can potentially run entirely using highly optimized integer arithmetic instructions on supported hardware (CPUs, GPUs, accelerators), leading to maximum speedup and energy efficiency.
- Lower Inference Overhead: No runtime cost for calculating activation ranges or quantization parameters.
Disadvantages:
- Requires Calibration Data: Needs a dataset that accurately reflects the distribution of inputs the model will see in production. Poor calibration data can lead to significant accuracy degradation.
- Sensitivity to Outliers: If runtime inputs produce activation ranges significantly different from those observed during calibration, accuracy can suffer. Outliers not captured during calibration are clipped, potentially losing important information.
Dynamic Quantization
Dynamic quantization, often referred to as "weight-only quantization" in some contexts, takes a different approach. Weights are quantized offline, similar to static quantization. However, activations are quantized dynamically or on-the-fly during the inference process.
- Offline Weight Quantization: Model weights are converted to a lower-precision integer format (e.g., INT8) and stored.
- Inference:
- When input data arrives, the activations are processed layer by layer.
- For operations involving quantized weights (like matrix multiplications), the incoming floating-point activations are quantized just before the operation. Their range (min/max) is calculated dynamically based on the current batch of data.
- The computation (e.g.,
INT8
weight * INT8
activation) is performed.
- The result is often de-quantized back to floating-point (FP32) before being passed to the next operation or layer that requires floating-point inputs (like certain activation functions or normalization layers).
Dynamic quantization workflow: Weights are quantized offline, but activations are quantized on-the-fly during inference, adding overhead but removing the need for calibration data.
Advantages:
- No Calibration Data Needed: Simplifies the quantization process as it doesn't require a separate calibration step or dataset for activations.
- Potentially More Robust: Adapts to the specific range of activations for each input, potentially handling unexpected distributions better than static quantization if calibration was suboptimal.
- Simplicity: Easier to implement initially.
Disadvantages:
- Higher Inference Overhead: Calculating activation ranges and performing quantization/dequantization dynamically during inference adds computational cost, reducing the potential speedup compared to static quantization.
- Memory Bandwidth: Continuously reading activations in FP32, quantizing them, performing the compute, and often de-quantizing back can increase memory bandwidth usage compared to a fully static INT8 pipeline.
- Not Always Fully Integer: The mix of integer weights and dynamically quantized/de-quantized activations means computations might not leverage pure integer arithmetic paths on all hardware, potentially limiting performance gains.
Choosing Between Static and Dynamic Quantization
The choice depends on the specific requirements of your application:
- Choose Static Quantization if:
- Maximum inference speed and energy efficiency are primary goals.
- You have access to representative calibration data.
- The target hardware has efficient support for full integer arithmetic pipelines.
- You are willing to invest time in the calibration process and potentially fine-tune it.
- Choose Dynamic Quantization if:
- Simplicity and ease of implementation are prioritized.
- Obtaining good calibration data is difficult or impractical.
- Some inference overhead is acceptable.
- The primary goal is weight memory reduction, and activation compute overhead is less critical.
For Large Language Models (LLMs), static quantization is often preferred to maximize throughput and reduce latency due to their computational intensity. However, the large activation ranges and presence of outliers in LLMs make effective calibration particularly important for static quantization. Dynamic quantization can be a simpler starting point, primarily reducing the memory footprint from weights, but the computational overhead of dynamic activation quantization can be substantial for LLMs. Advanced PTQ techniques, discussed later, often build upon the static quantization framework to address its limitations.