Quantization refactors the numbers within a model, a process distinct from optimizing the computational graph through model compilation. It reduces the numerical precision of a model's weights and, in some cases, activations. By converting 32-bit floating-point (FP32) numbers to lower-precision formats like 8-bit integers (INT8) or 8-bit floating-point (FP8), you can achieve significant performance gains. The primary benefits are threefold:
At its core, quantization maps a range of high-precision floating-point values to a smaller range of low-precision integer values. This is achieved using a scale factor and, optionally, a zero-point. The fundamental affine transformation is:
real_value=scale×(quantized_value−zero_point)Here, the scale is a floating-point number that defines the step size of the quantization, and the zero_point is an integer that ensures the real value zero maps correctly to a quantized value.
This transformation can be either symmetric or asymmetric:
UINT8). This is often used for activations, especially after a ReLU function where all values are non-negative.zero_point to 0, simplifying the mapping. The floating-point range is centered around zero (e.g., [−absmax,+absmax]) and mapped to the integer range (e.g., [−127,127] for INT8). This is frequently used for model weights, which are often normally distributed around zero.Mapping of floating-point ranges to integer ranges for asymmetric and symmetric quantization.
The method you use to determine the scale and zero_point parameters defines your quantization strategy. The two primary approaches are Post-Training Quantization and Quantization-Aware Training.
PTQ is the most straightforward method. It involves quantizing a model after it has already been fully trained in FP32. The process requires a "calibration" step where you run a small, representative sample of your validation data through the model. During this pass, the quantization framework records the dynamic range (minimum and maximum values) of the activations for each layer. These observed ranges are then used to calculate the optimal scale and zero_point parameters for quantizing the activations. The weights are quantized directly from the trained checkpoint.
Because PTQ doesn't involve retraining, it's fast and easy to implement. However, for some models, the precision loss can lead to an unacceptable drop in accuracy.
When PTQ results in poor model performance, QAT is the solution. QAT simulates the effects of quantization during the training process. It works by inserting "fake" or "simulated" quantization operations into the model's graph. These operations take FP32 inputs, simulate the rounding and clamping effects of converting to a lower precision format like INT8, and then convert the result back to FP32 for the subsequent layer.
This process forces the model's training algorithm (e.g., SGD) to learn weights that are resilient to the information loss from quantization. The model learns to adjust its weights to minimize the quantization error. While QAT is more complex and requires a full retraining cycle, it almost always yields higher accuracy than PTQ, often approaching the original FP32 model's performance.
Comparison of the workflows for Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT). PTQ is a post-processing step, while QAT integrates quantization simulation into the training loop.
While INT8 is the workhorse of quantization, newer formats are emerging to handle the unique demands of modern models, especially large language models (LLMs).
As discussed, INT8 offers a 4x model size reduction and significant speedups on hardware with dedicated integer acceleration units. GPUs like the NVIDIA A100 provide Tensor Cores that are highly optimized for INT8 matrix multiplication, delivering a substantial performance boost over FP16 and FP32. The performance difference is not trivial; it represents a step-function improvement in throughput.
Theoretical peak performance for different numerical formats on an NVIDIA A100 GPU. The jump to INT8 is significant for inference throughput.
For massive models like transformers, which can have activation values with very large dynamic ranges, INT8's fixed-point representation can sometimes be too restrictive and lead to accuracy degradation. This is where 8-bit floating-point, or FP8, comes in.
FP8 is not an integer format. It retains the structure of a floating-point number with a sign, exponent, and mantissa, just with fewer bits. This allows it to represent a much wider range of values than INT8, at the cost of precision between those values. There are two primary FP8 variants, defined by the Hopper Transformer Engine from NVIDIA:
FP8 is a newer technique requiring support in both hardware (e.g., NVIDIA H100 GPUs) and software frameworks. It represents the frontier of model optimization, providing a balance between the dynamic range of floating-point numbers and the computational efficiency of an 8-bit data type.
When implementing quantization, you rarely perform the low-level numerical conversions yourself. Instead, you use high-level tools that integrate these techniques into the deployment workflow.
FP32 model and a calibration dataset, and the tool automatically generates a highly optimized, quantized engine.torch.ao.quantization API for inserting quantization stubs and fine-tuning the model. TensorFlow has similar capabilities integrated into its tf.quantization module and the TFLite Converter.A common strategy is to start with PTQ due to its simplicity. If the accuracy drop is unacceptable (e.g., falls below a product-defined threshold), then invest the additional engineering effort to implement QAT. For very large models on the latest hardware, exploring FP8 can provide an additional performance edge. You can also apply mixed-precision quantization, where sensitive layers that contribute most to accuracy loss are kept in FP16 or FP32, while the rest of the model is converted to INT8.
With a model that is now not only structurally optimized but also numerically efficient, we are ready to serve it. The next step is to deploy this artifact using a production-grade inference server that can handle concurrent requests, manage multiple models, and further enhance performance through techniques like dynamic batching.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with