While techniques like pruning reduce the number of parameters, quantization targets the size of each individual parameter and activation value. Standard deep learning models typically perform computations using 32-bit floating-point numbers (FP32). Quantization is the process of converting these FP32 values into lower-precision representations, such as 16-bit floating-point (FP16) or, more commonly for significant efficiency gains, 8-bit integers (INT8).
This reduction in numerical precision directly translates into substantial benefits, particularly crucial when deploying models in environments with limited resources:
The core challenge is mapping the wide range and high precision of FP32 values to a much more limited set of lower-precision values while minimizing the loss of information.
For integer quantization (like INT8), this typically involves an affine mapping defined by a scale factor (S) and a zero-point (Z). The scale factor is a positive float that determines the step size between quantized levels, and the zero-point is an integer corresponding to the real value 0.0.
The mapping from a real value x (FP32) to its quantized integer representation xq (e.g., INT8) is: xq=clip(round(x/S)+Z,Qmin,Qmax) Here, round rounds to the nearest integer, Z shifts the result so that the real value 0.0 maps correctly, and clip enforces the value to stay within the representable range of the target integer type (e.g., [−128,127] for signed INT8, so Qmin=−128, Qmax=127).
To convert back (dequantize) for subsequent operations or analysis, the approximate real value x′ is recovered using: x′=S(xq−Z)
This mapping introduces quantization error, the difference between the original value x and the dequantized value x′. The goal of quantization methods is to choose S and Z carefully to minimize this error across the distribution of weights and activations in the network.
Two common choices for the mapping range exist:
Furthermore, the scale S and zero-point Z can be determined:
There are two primary approaches to applying quantization:
Post-Training Quantization (PTQ): This is the simpler method. You start with a fully trained FP32 model and then convert its weights to the target lower precision (e.g., INT8). Activations are typically quantized dynamically during inference. To determine the appropriate scaling factors (S) and zero-points (Z) for activations, PTQ requires a calibration step. This involves feeding a small, representative dataset (a few hundred samples) through the FP32 model to collect statistics (e.g., min/max ranges) of the activation distributions at various points in the network. These statistics inform the choice of S and Z. PTQ is convenient as it doesn't require retraining, but it can sometimes lead to a noticeable drop in accuracy, especially for more aggressive quantization (like INT8) or sensitive models.
Quantization-Aware Training (QAT): This approach integrates the quantization process into the training loop. It simulates the effects of quantization during forward passes by inserting "fake" quantization nodes that mimic the rounding and clipping behavior of INT8 inference. Gradients are typically passed straight through these nodes during backpropagation (using the Straight-Through Estimator technique). By fine-tuning the model with this simulated quantization in place, the network learns to become more robust to the precision reduction. QAT usually recovers most, if not all, of the accuracy lost by PTQ, but it requires access to the original training pipeline, data, and additional training compute.
Illustrative comparison of common quantization formats regarding size per parameter and potential relative inference speedup. Actual speedups heavily depend on the model, task, and target hardware capabilities.
While quantization offers substantial benefits, it's not a magic bullet.
Major deep learning frameworks provide tools to facilitate quantization:
torch.quantization
for PTQ (static and dynamic) and QAT workflows. Support for different backends (e.g., FBGEMM for x86, QNNPACK for ARM) targets efficient execution.Quantization is a powerful and widely used technique for optimizing deep learning models. By reducing the numerical precision of weights and activations, it significantly cuts down model size, accelerates inference speed, and lowers power consumption, making complex CNNs practical for deployment on a wider range of devices, especially at the edge. Choosing between PTQ and QAT, selecting the right precision format (FP16, INT8), and understanding the target hardware capabilities are important steps in successfully applying quantization.
© 2025 ApX Machine Learning