Large Language Models (LLMs) often require significant computational resources and memory. Model quantization offers a set of techniques to make these models smaller and faster by representing their parameters, like weights and activations, with lower-precision data types. This reduction is essential for deploying LLMs efficiently, especially on devices with limited resources.

This chapter establishes the necessary groundwork for understanding model quantization. We begin by situating quantization within the broader context of model compression. You will learn the primary motivations for quantizing LLMs, focusing on benefits like reduced memory footprint and accelerated inference speed.

We will cover how numbers are represented digitally, comparing standard floating-point formats (like $FP32$ ) with fixed-point and low-precision integer formats (such as $INT8$ or $INT4$ ) used in quantization. Key concepts including quantization schemes (symmetric vs. asymmetric) and granularity (per-tensor, per-channel, per-group) will be explained. Finally, we will introduce methods for measuring the error introduced by quantization and provide a high-level overview of the main approaches: Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT), setting the stage for subsequent chapters.

Chapter 1: Foundations of Model Quantization

Sections