Having established the fundamental challenges in deploying large language models efficiently, this chapter concentrates on quantization, a primary method for reducing model size and accelerating inference. While basic quantization offers benefits, achieving significant gains without sacrificing model accuracy requires more sophisticated approaches.
This chapter moves past introductory concepts to cover advanced quantization methodologies specifically applied to LLMs. You will learn to:
We will examine the practical application of these techniques, including outlier handling, accuracy preservation, and the interaction between quantization and hardware accelerators. The chapter concludes with a hands-on exercise implementing PTQ and QAT.
2.1 Quantization Fundamentals Revisited
2.2 Post-Training Quantization (PTQ)
2.3 Quantization-Aware Training (QAT)
2.4 Extreme Quantization
2.5 Mixed-Precision Quantization Strategies
2.6 Hardware Acceleration for Quantized Operations
2.7 Evaluating Fidelity and Performance of Quantized LLMs
2.8 Hands-on Practical: Implementing PTQ and QAT
© 2025 ApX Machine Learning