Post-Training Quantization (PTQ) applies quantization directly to a model that has already been trained. This approach avoids the need for computationally expensive retraining, making it an attractive option for reducing model size and improving inference speed. PTQ works by converting the model's weights, and sometimes activations, from high-precision formats like to lower-precision integer types such as or .
This chapter covers the practical aspects of PTQ:
2.1 Principles of Post-Training Quantization
2.2 Calibration: Selecting Representative Data
2.3 Static vs. Dynamic Quantization
2.4 Common PTQ Algorithms
2.5 Handling Outliers in PTQ
2.6 Applying PTQ to LLM Layers
2.7 Limitations of Basic PTQ
2.8 Hands-on Practical: Applying Static PTQ