Moving beyond the general principles of quantization, the effectiveness of any technique hinges critically on the specific numerical formats used to represent the compressed weights and, sometimes, activations. As we push into sub-8-bit regimes for LLMs, standard integer and floating-point types are often insufficient or suboptimal. Understanding the characteristics, advantages, and limitations of these low-bit data types is essential for implementing methods like GPTQ, AWQ, and leveraging hardware acceleration effectively.
This section examines the common and emerging data types used in LLM quantization, focusing on formats below INT8. We'll analyze their structure, representational capabilities, and the trade-offs they entail.
Integer formats are conceptually simpler, typically involving a scaling factor and sometimes a zero-point to map the quantized integers back to the approximate floating-point range.
While this course focuses on more advanced techniques, INT8 (8-bit integer) serves as a frequent baseline. Using 8 bits provides 28=256 distinct levels. This often offers a good balance between compression and accuracy preservation, especially when using techniques like per-channel scaling. Hardware support for INT8 operations is widespread on both CPUs and GPUs, making it a well-established optimization target. Its representation follows the principles discussed earlier:
Where q is the INT8 value, s is the scaling factor, and z is the zero-point (an INT8 value representing the floating-point zero).
Reducing the bit width further to INT4 dramatically cuts memory requirements by half compared to INT8 and enables significantly faster computations on compatible hardware. However, it provides only 24=16 distinct representation levels. This extremely coarse quantization makes preserving accuracy much more challenging.
Research explores even lower bit widths like INT3 (23=8 levels) and INT2 (24=4 levels). These offer maximum compression but face severe accuracy degradation. They are currently less common in practical deployments and often require sophisticated techniques like quantization-aware training or highly specialized algorithms to be viable, often restricted to specific model parts.
Floating-point formats allocate bits differently, assigning some to an exponent and others to a mantissa (or significand). This allows them to represent a wider dynamic range compared to integers of the same bit width, albeit with varying precision across that range.
FP8 utilizes 8 bits, similar to INT8, but adopts a floating-point representation. Two main variants exist, defined by the number of exponent (E) and mantissa (M) bits, plus one sign bit:
FP8 formats strike a balance, offering better handling of outlier values compared to INT8 due to the exponent bits, while still providing significant memory and computational savings over FP16/BF16. Native FP8 support is a feature of recent hardware like NVIDIA H100/L40 GPUs, making it an increasingly important format for efficient LLM inference.
FP4 pushes the envelope with only 4 bits in a floating-point structure. A common configuration is E2M1 (1 sign, 2 exponent, 1 mantissa).
Recognizing the limitations of standard formats, researchers have developed specialized data types tailored to the observed distributions of values in LLMs.
NormalFloat 4-bit (NF4) is a non-uniform data type introduced with the QLoRA method. It's designed based on the observation that weights in pre-trained LLMs often follow a normal distribution N(μ,σ2), typically centered around zero (μ=0).
{"layout": {"xaxis": {"title": "Value", "range": [-1.1, 1.1], "zeroline": false}, "yaxis": {"showticklabels": false, "showgrid": false, "zeroline": false, "range": [-0.1, 1.1]}, "title": "Distribution of Representable Values (4-bit Formats, Scaled to [-1, 1])", "height": 250, "margin": {"t": 60, "b": 40, "l": 20, "r": 20}}, "data": [{"x": [-1.0, -0.857, -0.714, -0.571, -0.429, -0.286, -0.143, 0, 0.143, 0.286, 0.429, 0.571, 0.714, 0.857, 1.0], "y": [1]*15, "mode": "markers", "name": "INT4 (Asymm, Zero=7)", "marker": {"color": "#339af0", "size": 8}}, {"x": [-1.0, -0.696, -0.525, -0.394, -0.28, -0.178, -0.082, 0.0, 0.082, 0.178, 0.28, 0.394, 0.525, 0.696, 1.0], "y": [0.5]*15, "mode": "markers", "name": "NF4 (NormalFloat)", "marker": {"color": "#20c997", "size": 8}}, {"x": [-1.0, -0.667, -0.5, -0.333, -0.25, -0.167, 0.0, 0.167, 0.25, 0.333, 0.5, 0.667, 1.0], "y": [0]*13, "mode": "markers", "name": "FP4 (E2M1, Hypothetical)", "marker": {"color": "#fd7e14", "size": 8}}]}
Comparison of representable values for INT4 (uniform spacing), NF4 (non-uniform, denser near zero), and a hypothetical FP4 E2M1 (non-uniform, limited values) when scaled to the range [-1, 1]. This illustrates the different precision distributions offered by each format.
The choice of data type involves navigating several trade-offs:
Feature | INT8 | INT4 | FP8 (E4M3) | FP4 (E2M1) | NF4 |
---|---|---|---|---|---|
Bits | 8 | 4 | 8 | 4 | 4 |
Levels | 256 | 16 | ~240 unique | ~14 unique | 16 |
Type | Integer | Integer | Floating Point | Floating Point | Fixed (Non-Uniform) |
Uniformity | Yes | Yes | No | No | No |
Precision | Moderate | Low | Moderate (Varies) | Very Low (Varies) | High near zero |
Range | Limited (by scale) | Very Limited | Wider | Limited | Limited (by scale) |
Hardware | Common | Emerging/Kernels | Emerging (GPU) | Research/Kernels | Kernels |
Common Use | Weights/Activations | Weights (PTQ) | Weights/Activations | Weights (Research) | Weights (QLoRA) |
Key Factors in Choosing a Format:
In summary, the data type is not just a detail but a fundamental choice in LLM quantization. Formats like INT4, FP8, and NF4 enable the significant memory and computational savings required for deploying large models efficiently. However, they must be used thoughtfully, often in conjunction with sophisticated quantization algorithms and careful evaluation, to balance performance gains against potential accuracy impacts. Understanding these formats provides the basis for selecting and implementing the advanced techniques discussed in the following chapters.
© 2025 ApX Machine Learning