Model compression and acceleration in Large Language Models often rely on quantization. This technique converts the floating-point numbers typically used in deep learning calculations into formats that require less memory, such as fixed-point or integer representations. To grasp how quantization works, it is important to understand the digital representation of numbers within a computer.Floating-Point RepresentationFloating-point numbers are the standard way computers represent real numbers (numbers with fractional parts). They offer a wide dynamic range, meaning they can represent very small and very large numbers, and reasonable precision. Common formats in machine learning include:FP32 (Single-Precision): Uses 32 bits per number. This is the default format for training and often for baseline inference in many frameworks. It consists of 1 bit for the sign, 8 bits for the exponent (determining the magnitude), and 23 bits for the mantissa or significand (determining the precision).FP16 (Half-Precision): Uses 16 bits. It reduces memory usage by half compared to FP32. It has 1 sign bit, 5 exponent bits, and 10 mantissa bits. The smaller exponent range makes it more susceptible to underflow (numbers becoming zero) or overflow (numbers becoming infinity) if the values span a very wide range.BF16 (BFloat16 - Brain Floating-Point): Also uses 16 bits. It keeps the same exponent range as FP32 (8 bits) but reduces the mantissa to 7 bits (plus 1 sign bit). This preserves the dynamic range of FP32, making it less prone to overflow/underflow issues compared to FP16, but it offers lower precision.digraph G { rankdir=LR; node [shape=record, fontname="Arial", fontsize=10]; FP32 [label="{ <sign> Sign (1) | <exp> Exponent (8) | <mant> Mantissa (23) }", width=3]; FP16 [label="{ <sign> Sign (1) | <exp> Exponent (5) | <mant> Mantissa (10) }", width=3]; BF16 [label="{ <sign> Sign (1) | <exp> Exponent (8) | <mant> Mantissa (7) }", width=3]; }Structure of common floating-point formats used in deep learning.While floating-point numbers provide flexibility, their storage and computational requirements can be substantial, especially for LLMs with billions of parameters. Operations on 32-bit floats are generally slower and more energy-intensive than operations on smaller integer types.Fixed-Point RepresentationFixed-point representation is an alternative way to represent real numbers. Unlike floating-point, where the position of the decimal (or binary) point can "float", in fixed-point representation, the position of the binary point is fixed.A fixed-point number is essentially an integer that is implicitly scaled by a predetermined factor. For example, you might decide to use an 8-bit signed integer (ranging from -128 to 127) to represent numbers between -1.0 and +1.0. To do this, you would define a scaling factor. If you choose a scaling factor of $2^7 = 128$, then:The integer value 127 would represent $127 / 128 \approx 0.992$.The integer value -128 would represent $-128 / 128 = -1.0$.The integer value 0 would represent $0 / 128 = 0.0$.The number of bits allocated to the integer part and the fractional part determines the range and precision.More integer bits: Wider range, less precision.More fractional bits: Narrower range, more precision.Compared to floating-point:Range: Fixed-point has a much more limited dynamic range. Values outside the pre-defined range cannot be represented accurately.Precision: Precision is uniform across the representable range.Hardware: Arithmetic operations (addition, multiplication) on fixed-point numbers can often be implemented using simpler and faster integer logic units on hardware.Integer Representation in QuantizationMany quantization techniques convert FP32 weights and/or activations directly into low-precision integer types, such as:INT8: 8-bit integer. Can be signed (-128 to 127) or unsigned (0 to 255).INT4: 4-bit integer. Typically signed (-8 to 7) or unsigned (0 to 15).These integer types offer significant memory savings (e.g., INT8 uses 4x less memory than FP32) and allow for much faster computation on hardware equipped with specialized integer arithmetic units (often found in modern CPUs and GPUs/TPUs).When we use INT8 or INT4 in quantization, we are effectively using a fixed-point representation. The conversion process from floating-point involves determining a scale factor (similar to the fixed-point example above) and often a zero-point (or offset). These parameters map the original floating-point range to the target integer range.For instance, to map a floating-point range $[R_{min}, R_{max}]$ to an 8-bit integer range $[0, 255]$, the mapping might look something like this:$$ \text{Integer Value} = \text{round}\left( \frac{\text{Floating-Point Value} - Z}{S} \right) $$Where:$S$ is the scale factor (a float).$Z$ is the zero-point (often corresponding to the floating-point value 0.0, but represented as an integer in the target range).This mapping allows us to perform computations using efficient integer arithmetic, converting back to floating-point only when necessary. The choice between floating-point and fixed-point/integer representations involves a trade-off between numerical range/precision and computational efficiency/memory usage. Quantization aims to leverage the efficiency of lower-precision formats while minimizing the impact on the model's accuracy. The next sections explore how this mapping is performed in practice.