After understanding the fundamental difference between floating-point and fixed-point representations, let's focus on the specific integer data types commonly used in model quantization. The core idea is to replace the high-precision floating-point numbers (like 32-bit floats, or FP32) used for weights and sometimes activations in the original LLM with lower-precision integers. This drastically reduces the model's memory footprint and can significantly speed up computations, especially on hardware optimized for integer arithmetic.
The most prevalent integer types in quantization are 8-bit integers (INT8) and 4-bit integers (INT4).
An 8-bit integer uses 8 bits to represent a number. This allows for 28=256 distinct values. Depending on whether we need to represent negative numbers, INT8 can be:
Using INT8 reduces the data size by a factor of 4 compared to FP32 (since FP32 uses 32 bits). This directly translates to a 4x reduction in model size if only weights are quantized, and potentially significant memory savings for activations during inference. Many modern CPUs and GPUs have specialized instructions for performing INT8 computations quickly, leading to substantial speedups. INT8 quantization often provides a good balance between computational efficiency and maintaining model accuracy, making it a popular choice for many applications.
For even greater compression and potential speed gains, 4-bit integers (INT4) can be used. A 4-bit integer allows for only 24=16 distinct values.
Moving from FP32 to INT4 represents an 8x reduction in data size. This is particularly attractive for deploying very large models on resource-constrained devices or for fitting larger models into available GPU memory. However, representing the original range of floating-point values with only 16 distinct levels is much more challenging than with 256 levels (INT8). Consequently, INT4 quantization generally introduces more quantization error and can lead to a noticeable drop in model accuracy if not applied carefully. Advanced techniques, which we will cover later (like GPTQ and AWQ in Chapter 3), are often required to achieve good performance with INT4.
While INT8 and INT4 are the most common, research and specific applications sometimes explore other bit widths:
The chart below visualizes the reduction in bit width for common data types used in quantization compared to the standard FP32.
Comparison of the number of bits used by different numerical data types relevant to model quantization. Lower bit widths lead to smaller model sizes and potentially faster inference.
The choice of integer data type involves a trade-off. Lower bit widths (like INT4) provide greater memory savings and potential speedups but increase the risk of accuracy loss due to the coarser representation. Higher bit widths (like INT8) retain more precision but offer less compression. The selection depends on the specific LLM, the target hardware, and the acceptable tolerance for performance degradation.
Understanding these integer types is foundational. In the following sections, we'll explore how we map the original floating-point values onto the limited range offered by these integers using different schemes and granularities.
© 2025 ApX Machine Learning