Once you have a trained model and a representative calibration dataset, the core task in Post-Training Quantization (PTQ) is determining the quantization parameters: the scale factor (s) and the zero-point (z). These parameters define the mapping from the floating-point range observed during calibration to the target integer range (e.g., INT8, typically [-128, 127]). Several algorithms exist to calculate these parameters, each with its own trade-offs between simplicity, robustness to outliers, and potential accuracy impact. Let's explore the most common ones.
MinMax Quantization
This is the simplest and most straightforward approach. It uses the absolute minimum and maximum values observed in the calibration data for a given tensor (or quantization group) to determine the range.
How it works:
- During calibration, track the minimum (xmin) and maximum (xmax) values encountered for the tensor being quantized.
- Use these values directly to compute the scale and zero-point.
For asymmetric quantization, where the zero-point can be any integer within the target range, the formulas are:
s=Qmax−Qminxmax−xmin
z=round(Qmax−sxmax)
Here, Qmin and Qmax represent the minimum and maximum values of the target integer range (e.g., -128 and 127 for signed INT8). The zero-point z is essentially the integer value that corresponds to the floating-point value 0.0.
For symmetric quantization, the range is centered around zero, forcing the zero-point z to be 0 (for signed integers). The scale is determined by the maximum absolute value observed:
s=Qmaxmax(∣xmin∣,∣xmax∣)
z=0
(Here, Qmax would be 127 for signed INT8, mapping the range [−max(∣…∣),+max(∣…∣)] to [−127,127].)
Pros:
- Very simple to implement and understand.
- Guarantees that all observed calibration values fall within the representable range after quantization (no clipping based on calibration data).
Cons:
- Highly sensitive to outliers. A single extreme value, even if rare, can dramatically expand the required range (xmax−xmin or max(∣xmin∣,∣xmax∣)). This results in a larger scale factor s, meaning fewer integer values are available to represent the bulk of the data distribution, potentially leading to significant quantization error and accuracy degradation.
Histogram showing a distribution with and without an outlier. MinMax quantization uses the full range including the outlier (e.g., up to 1.5), which may reduce precision for the more frequent values between 0.1 and 0.3.
Percentile Quantization
To mitigate the sensitivity of MinMax to outliers, Percentile quantization uses values from the tails of the distribution, but not the absolute extremes.
How it works:
- Collect the distribution of values during calibration (often as a histogram).
- Choose lower and upper percentile thresholds (e.g., 0.1% and 99.9%, or 1% and 99%).
- Determine the floating-point values corresponding to these percentiles, let's call them plow and phigh.
- Use plow and phigh instead of xmin and xmax in the scale and zero-point calculation formulas (either symmetric or asymmetric).
- Any original floating-point values falling outside the [plow,phigh] range will be clamped (or saturated) to the minimum or maximum representable integer value during quantization.
Pros:
- Much more robust to outliers than MinMax. By ignoring the most extreme values, it often results in a smaller scale factor s and better precision for the majority of the data distribution.
- Relatively simple to implement, requiring only histogram collection and percentile calculation.
Cons:
- Introduces saturation error for the outlier values that are clipped. If these outliers carry significant information, clipping them might negatively impact accuracy.
- Requires choosing appropriate percentile values, which becomes a hyperparameter to tune. The optimal percentiles might vary between different layers or models.
Entropy (KL Divergence) Quantization
This method takes a more information-theoretic approach. It aims to find a quantization range (defined by a clipping threshold) that minimizes the information loss between the original floating-point distribution and the quantized distribution. The most common metric used to measure this information loss is the Kullback-Leibler (KL) divergence.
How it works:
- Collect a histogram of the activation values from the calibration dataset. This represents the original probability distribution P.
- Iterate through different possible clipping thresholds. For each threshold:
a. Define a quantization range based on the threshold (typically symmetrically, [−threshold,+threshold]).
b. Quantize the values within the threshold using this range. Values outside the threshold are saturated to the min/max quantized value.
c. Create a new histogram representing the distribution Q after quantizing and then de-quantizing the values (mapping them back to floating-point approximations). This distribution Q will differ from P due to quantization errors and saturation.
d. Calculate the KL divergence between the original distribution P and the quantized distribution Q. DKL(P∣∣Q).
- Select the threshold that resulted in the minimum KL divergence. This threshold defines the optimal range for quantization according to this criterion.
- Calculate the final scale (s) and zero-point (z) based on the selected threshold, similar to symmetric MinMax but using the optimal threshold instead of the absolute maximum.
Pros:
- Often yields the best accuracy among common PTQ algorithms, especially for non-uniform or skewed distributions, as it directly tries to preserve the overall distribution shape.
- Provides a more principled way to handle the trade-off between saturation (clipping outliers) and quantization error (precision for inliers).
Cons:
- Computationally more expensive during calibration due to the need for histogramming and iteratively testing many thresholds and computing KL divergence.
- The result can be sensitive to the number of bins used in the histogram representation of the distributions.
Choosing an Algorithm
The best PTQ algorithm often depends on the specific circumstances:
- Weights: Weight distributions are often relatively stable and somewhat symmetric around zero. MinMax or Percentile methods (often symmetric) are commonly used and perform reasonably well.
- Activations: Activation distributions can vary significantly depending on the input data and the layer type. They are often asymmetric and can have significant outliers. Entropy (KL divergence) or Percentile methods are frequently preferred for activations to better handle these characteristics. Asymmetric quantization schemes are also more common for activations.
- Performance vs. Accuracy: MinMax is fastest during calibration but most prone to accuracy loss from outliers. Entropy is slowest during calibration but often yields the highest accuracy. Percentile offers a balance between the two.
In practice, libraries like Hugging Face Optimum or PyTorch's quantization modules often provide implementations of these algorithms, allowing you to experiment and choose the one that best meets your accuracy and performance requirements for a specific model and task. You might even mix different strategies for weights and activations within the same model.