Post-Training Quantization (PTQ) applies quantization directly to a model that has already been trained. This approach avoids the need for computationally expensive retraining, making it an attractive option for reducing model size and improving inference speed. PTQ works by converting the model's weights, and sometimes activations, from high-precision formats like FP32 to lower-precision integer types such as INT8 or INT4.
This chapter covers the practical aspects of PTQ:
2.1 Principles of Post-Training Quantization
2.2 Calibration: Selecting Representative Data
2.3 Static vs. Dynamic Quantization
2.4 Common PTQ Algorithms
2.5 Handling Outliers in PTQ
2.6 Applying PTQ to LLM Layers
2.7 Limitations of Basic PTQ
2.8 Hands-on Practical: Applying Static PTQ
© 2025 ApX Machine Learning