Once a model has been quantized using methods like Post-Training Quantization or Quantization-Aware Training, practical questions arise: How are these low-precision models saved, loaded, and executed efficiently? Standard model serialization might not optimally handle the specific structures and metadata (like scaling factors or zero-points) required for formats such as INT4 or INT8.
This chapter addresses these practical considerations by introducing common formats and software tools used in the quantized LLM ecosystem. We will cover:
bitsandbytes
for performing efficient low-bit operations during inference.You will gain familiarity with converting models to these formats and using the associated tooling to load and run them effectively.
5.1 Overview of Common Quantized Model Formats
5.2 GGUF: Structure and Usage
5.3 GPTQ Format: Library Support and Implementation
5.4 AWQ Format Details
5.5 Working with Hugging Face Transformers and Optimum
5.6 Using bitsandbytes for Quantization
5.7 Tools for Model Conversion and Loading
5.8 Practice: Converting and Loading Quantized Formats
© 2025 ApX Machine Learning