Having established the fundamental concepts like number representations, quantization schemes, and granularity, we can now introduce the two primary strategies employed to perform model quantization: Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT). These approaches differ significantly in how and when the quantization process is applied relative to the model's training phase. Understanding their core principles is essential before we examine specific algorithms in later chapters.
Post-Training Quantization, as the name suggests, involves quantizing a model after it has already been trained using standard high-precision floating-point numbers (typically FP32). This is often the simplest and quickest method to apply quantization.
The general workflow for PTQ is straightforward:
A common step in many PTQ methods is calibration. This involves feeding a small, representative dataset (the calibration dataset) through the FP32 model to observe the typical range and distribution of activation values. These observed ranges are then used to calculate more accurate quantization parameters (S and Z) for the activations, aiming to minimize the quantization error introduced. Methods that quantize activations based on runtime statistics without a calibration step are often termed dynamic quantization, while those using pre-computed statistics from a calibration set are called static quantization.
The primary advantage of PTQ is its ease of implementation. It doesn't require access to the original training dataset or the training infrastructure, making it accessible even when you only have the pre-trained model file. However, applying quantization after training can sometimes lead to a noticeable drop in model accuracy, particularly when quantizing to very low bit-widths (like INT4 or below) or for models sensitive to precision changes. When basic PTQ methods prove insufficient, more advanced PTQ techniques (covered in Chapter 3) or QAT might be necessary.
Quantization-Aware Training takes a different approach by incorporating the effects of quantization during the model training or fine-tuning process. Instead of quantizing a fully trained model, QAT simulates the lower-precision behavior while the model's weights are still being updated.
The core idea is to insert operations into the model's computation graph that mimic the quantization and dequantization steps (FP32→INT8→FP32). These are often called "fake quantization" nodes. During the forward pass of training, weights and/or activations are quantized and then immediately dequantized before being used in subsequent operations. This forces the model to learn parameters that are robust to the noise and information loss introduced by the quantization process.
Since standard quantization functions (like rounding) are non-differentiable, a technique called the Straight-Through Estimator (STE) is commonly used. STE essentially approximates the gradient of the quantization function during the backward pass, allowing gradients to flow back through the simulated quantization nodes and enabling the model weights to be updated via standard optimization algorithms like SGD or Adam.
QAT typically starts with a pre-trained FP32 model and then fine-tunes it for a relatively small number of epochs with these fake quantization operations active. The main benefit of QAT is that it often yields higher accuracy compared to PTQ, especially for aggressive quantization targets (e.g., INT4). The model learns to compensate for quantization errors during training.
However, QAT comes with its own set of challenges. It requires access to a representative training (or fine-tuning) dataset and the training infrastructure. The training process itself is more complex and computationally expensive than simply applying PTQ. Careful hyperparameter tuning might also be needed to ensure stable training.
The choice between PTQ and QAT often depends on the specific requirements of your application:
Comparison of Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) workflows. PTQ applies quantization after training, often using calibration data, while QAT simulates quantization during a fine-tuning phase using training data.
In the following chapters, we will explore specific algorithms and practical implementations for both PTQ (Chapters 2 and 3) and QAT (Chapter 4), providing you with the tools to choose and apply the most suitable technique for your LLM quantization needs.
© 2025 ApX Machine Learning