Quantizing a model changes its fundamental representation. Instead of storing weights and activations as standard 32-bit floating-point numbers (FP32), we now deal with lower-precision types like 8-bit integers (INT8) or even 4-bit integers (INT4). Simply saving these low-precision weights into a standard file isn't enough. We also need to store crucial metadata associated with the quantization process itself.
Consider what happens during quantization: we map the original floating-point range to a smaller integer range. This mapping requires parameters such as:
Standard model serialization formats, often designed for FP32 or FP16 models, don't have dedicated structures to efficiently store and retrieve this metadata alongside the quantized weights. Trying to retrofit them can be cumbersome and inefficient.
This leads to the need for specialized quantized model formats. These formats are designed specifically to package low-precision weights together with all necessary quantization parameters and sometimes even model architecture details or configuration settings. The primary goals driving the development of these formats include:
In this chapter, we'll examine some of the most prevalent formats you'll encounter when working with quantized LLMs:
llama.cpp
ecosystem. It's designed to be extensible and store rich metadata, supporting various quantization types and model architectures. We will look at its structure and usage in detail later.safetensors
, relying on associated configuration files. Libraries like AutoGPTQ
or ExLlama
know how to interpret these conventions.It's important to distinguish between the quantization algorithm (like PTQ, GPTQ, AWQ, QAT) which defines how the model weights and/or activations are converted to lower precision, and the file format which defines how the resulting quantized model and its metadata are stored on disk. While some formats are closely tied to specific algorithms (like the conventions used with GPTQ), others like GGUF aim for broader applicability.
The following diagram provides a conceptual overview of how these components relate:
Conceptual flow showing how quantization algorithms produce weights and parameters, which are then stored in specialized file formats for use by inference engines.
Understanding these formats is essential for practical deployment. They bridge the gap between the theoretical process of quantization and the efficient execution of the resulting compact models. The following sections will provide more detailed information on GGUF, the conventions associated with GPTQ, and the AWQ format, along with the tools used to create and interact with them.
© 2025 ApX Machine Learning