Once a model has been quantized using methods like Post-Training Quantization or Quantization-Aware Training, practical questions arise: How are these low-precision models saved, loaded, and executed efficiently? Standard model serialization might not optimally handle the specific structures and metadata (like scaling factors or zero-points) required for formats such as $INT4$ or $INT8$ .

This chapter addresses these practical considerations by introducing common formats and software tools used in the quantized LLM ecosystem. We will cover:

Specialized File Formats: Understanding the structure and purpose of formats like GGUF, the conventions used for GPTQ-quantized models, and details of the AWQ format.
Essential Libraries: Working with libraries such as Hugging Face Optimum for applying quantization and managing models, and bitsandbytes for performing efficient low-bit operations during inference.

You will gain familiarity with converting models to these formats and using the associated tooling to load and run them effectively.

Chapter 5: Quantization Formats and Tooling

Sections