To understand how quantization achieves model compression and acceleration, we first need to look at how numbers are represented digitally within a computer. Large Language Models, like most deep learning models, typically perform calculations using floating-point numbers. Quantization involves converting these numbers into formats that require less memory, often fixed-point or integer representations.
Floating-point numbers are the standard way computers represent real numbers (numbers with fractional parts). They offer a wide dynamic range, meaning they can represent very small and very large numbers, and reasonable precision. Common formats in machine learning include:
Structure of common floating-point formats used in deep learning.
While floating-point numbers provide flexibility, their storage and computational requirements can be substantial, especially for LLMs with billions of parameters. Operations on 32-bit floats are generally slower and more energy-intensive than operations on smaller integer types.
Fixed-point representation is an alternative way to represent real numbers. Unlike floating-point, where the position of the decimal (or binary) point can "float", in fixed-point representation, the position of the binary point is fixed.
A fixed-point number is essentially an integer that is implicitly scaled by a predetermined factor. For example, you might decide to use an 8-bit signed integer (ranging from -128 to 127) to represent numbers between -1.0 and +1.0. To do this, you would define a scaling factor. If you choose a scaling factor of 27=128, then:
The number of bits allocated to the integer part and the fractional part determines the range and precision.
Compared to floating-point:
Many quantization techniques convert FP32 weights and/or activations directly into low-precision integer types, such as:
These integer types offer significant memory savings (e.g., INT8 uses 4x less memory than FP32) and allow for much faster computation on hardware equipped with specialized integer arithmetic units (often found in modern CPUs and GPUs/TPUs).
When we use INT8 or INT4 in quantization, we are effectively using a fixed-point representation. The conversion process from floating-point involves determining a scale factor (similar to the fixed-point example above) and often a zero-point (or offset). These parameters map the original floating-point range to the target integer range.
For instance, to map a floating-point range [Rmin,Rmax] to an 8-bit integer range [0,255], the mapping might look something like this:
Integer Value=round(SFloating-Point Value−Z)
Where:
This mapping allows us to perform computations using efficient integer arithmetic, converting back to floating-point only when necessary. The choice between floating-point and fixed-point/integer representations involves a trade-off between numerical range/precision and computational efficiency/memory usage. Quantization aims to leverage the efficiency of lower-precision formats while minimizing the impact on the model's accuracy. The next sections explore how this mapping is performed in practice.
© 2025 ApX Machine Learning