Large Language Models (LLMs) often require significant computational resources and memory. Model quantization offers a set of techniques to make these models smaller and faster by representing their parameters, like weights and activations, with lower-precision data types. This reduction is essential for deploying LLMs efficiently, especially on devices with limited resources.
This chapter establishes the necessary groundwork for understanding model quantization. We begin by situating quantization within the broader context of model compression. You will learn the primary motivations for quantizing LLMs, focusing on benefits like reduced memory footprint and accelerated inference speed.
We will cover how numbers are represented digitally, comparing standard floating-point formats (like FP32) with fixed-point and low-precision integer formats (such as INT8 or INT4) used in quantization. Key concepts including quantization schemes (symmetric vs. asymmetric) and granularity (per-tensor, per-channel, per-group) will be explained. Finally, we will introduce methods for measuring the error introduced by quantization and provide a high-level overview of the main approaches: Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT), setting the stage for subsequent chapters.
1.1 Introduction to Model Compression
1.2 Why Quantize Large Language Models?
1.3 Representing Numbers: Floating-Point vs. Fixed-Point
1.4 Integer Data Types in Quantization
1.5 Quantization Schemes: Symmetric vs. Asymmetric
1.6 Quantization Granularity Options
1.7 Measuring Quantization Error
1.8 Overview of Quantization Techniques
© 2025 ApX Machine Learning