While standard 8-bit integer (INT8) quantization offers significant benefits, the sheer scale of modern Large Language Models (LLMs) often necessitates even more aggressive compression. Pushing below INT8, primarily into the 4-bit regime, unlocks substantial reductions in memory footprint and can accelerate computation, making it feasible to run larger models on resource-constrained hardware. However, this increased compression comes at the cost of potential accuracy degradation, requiring careful consideration of the techniques employed.
This section explores prominent low-bit quantization strategies used for LLMs, focusing on methods operating below INT8 precision.
The most direct extension of INT8 quantization is reducing the precision further to 4 bits (INT4). Using only 4 bits means we can represent 24=16 distinct integer values. This immediately halves the memory requirement compared to INT8 and offers a fourfold reduction compared to 16-bit floating-point formats (like FP16 or BF16).
Similar to INT8, INT4 quantization involves mapping the original floating-point values (weights or activations) to this limited set of integers. The mapping typically uses a scaling factor (s) and optionally a zero-point (z):
The clipping bounds depend on whether a signed or unsigned representation is chosen. The scaling factor (s) and zero-point (z) can be determined per-tensor, per-channel, or even per-group of weights within a channel. Group-wise quantization (e.g., grouping 64 or 128 values within a tensor and calculating scales/zero-points for each group) has become common in LLM quantization as it provides finer granularity than per-channel, helping to preserve accuracy, especially with lower bit depths like INT4.
Challenges: Representing the wide dynamic range of LLM weights and activations with only 16 distinct levels is challenging. Outlier values, which can be crucial for model performance, are particularly susceptible to large quantization errors. Consequently, vanilla INT4 post-training quantization often leads to unacceptable accuracy loss. Advanced PTQ algorithms like GPTQ and AWQ (discussed later) were specifically developed to mitigate this degradation when targeting INT4. Furthermore, efficient hardware support for INT4 matrix multiplication is essential for realizing performance gains, although this is becoming more common in modern GPUs and accelerators.
Instead of integers, low-bit quantization can also utilize floating-point representations. Formats like FP4 (4-bit floating-point) and FP8 (8-bit floating-point) allocate bits differently than integers, typically reserving bits for a sign, an exponent, and a mantissa.
For example, a hypothetical FP4 format might use 1 bit for the sign, 2 bits for the exponent, and 1 bit for the mantissa. This structure allows low-bit floating-point numbers to represent a wider dynamic range compared to INT formats of the same bit width, albeit with potentially lower precision between representable numbers. The exponent bits enable the representation of both very small and very large numbers more effectively than a linear integer mapping. This can be advantageous for handling the outlier values often present in LLM weights and activations.
FP8 has gained significant traction, particularly with native hardware support in newer architectures like NVIDIA's Hopper and Ada Lovelace GPUs. Two common FP8 variants are:
While FP4 is less standardized and hardware support is rarer compared to FP8, the concept illustrates an alternative approach to low-bit representation. The choice between low-bit integer and floating-point formats involves trade-offs between dynamic range coverage, precision, and the availability of efficient computational kernels on target hardware.
NF4 is a specialized 4-bit data type introduced in the QLoRA paper. It's designed based on the observation that weights in pre-trained neural networks often follow a zero-centered normal distribution. NF4 is an information-theoretically optimal format for data conforming to this distribution.
Instead of using uniformly spaced quantization levels like standard INT4, NF4 uses non-uniform levels determined by the quantiles of a standard normal distribution (N(0,1)). Specifically, the 16 representable values in NF4 are chosen such that each value represents an equal 1/16 portion of the probability mass (area under the curve) of the N(0,1) distribution. This means that NF4 allocates more precision to values near zero, where the bulk of the normally distributed weights reside, and less precision to the rarer, larger magnitude values in the tails.
Key characteristics of NF4:
NF4 is often used in conjunction with techniques like double quantization (quantizing the quantization parameters themselves) to further reduce memory overhead. Libraries like bitsandbytes
provide efficient implementations for NF4 quantization and computation.
The diagram above conceptually illustrates the difference between uniform quantization (like standard INT4) and non-uniform quantization (like NF4). Uniform quantization uses evenly spaced levels, while NF4 concentrates its representational power around zero, reflecting the typical distribution of LLM weights. Note that this is a simplified 1D representation; actual quantization involves scaling.
Selecting the appropriate low-bit technique depends on several factors:
bitsandbytes
for NF4, TensorRT-LLM
for INT4/FP8)?Moving below INT8 introduces complexities but is often a necessary step for deploying the largest LLMs efficiently. Understanding the properties and trade-offs of INT4, low-bit FP formats, and specialized types like NF4 is fundamental to applying these techniques effectively. The success of these methods, however, heavily relies on the calibration data and the sophistication of the post-training quantization algorithms used, which we will explore in subsequent sections.
© 2025 ApX Machine Learning