While Post-Training Quantization (PTQ) offers a straightforward way to quantize models after they have been trained, it can sometimes lead to a noticeable drop in accuracy, especially when moving to very low precision like 4-bit integers. When preserving accuracy is a primary concern and the results from PTQ are insufficient, Quantization-Aware Training (QAT) presents a different approach.

QAT integrates the quantization process directly into the model training or fine-tuning phase. By simulating the effects of lower precision arithmetic during training, the model learns to adapt its weights to minimize the accuracy loss caused by quantization. This often yields higher accuracy for the final quantized model compared to applying PTQ to the same original model, particularly for aggressive quantization targets.

In this chapter, you will learn the fundamentals of QAT:

How quantization is simulated within the training loop using 'fake quant' operations.
The Straight-Through Estimator (STE) technique, used to calculate gradients through the non-differentiable quantization function ( $q(x)$ ).
Implementing QAT workflows using common deep learning libraries.
The procedure for fine-tuning a pre-trained model with QAT.
Comparing the advantages and disadvantages of QAT relative to PTQ.
Practical considerations for setting up and running QAT experiments.

Completing this chapter will equip you with the knowledge to determine when QAT is appropriate and how to implement it to produce more accurate low-precision models.

Chapter 4: Quantization-Aware Training (QAT)

Sections