Quantization maps floating-point values to a lower-precision integer range. This mapping requires two parameters: a scale (S) factor and a zero-point (Z). The scale defines the step size of the quantization, representing the difference in the real world value for each increment in the quantized value. The zero-point ensures that the real number zero maps correctly to a quantized value. For an affine (asymmetric) quantization scheme, the relationship is typically:
real_value=S×(quantized_value−Z)
For symmetric quantization, the zero-point Z is often implicitly zero or fixed. Handling these scale and zero-point parameters correctly throughout the compilation process is fundamental for maintaining model accuracy while unlocking the performance benefits of low-precision computation. The compiler bears the responsibility of managing, propagating, and optimizing calculations involving these parameters.
Representing Quantization Parameters in IR
Compiler Intermediate Representations (IRs) need mechanisms to associate quantization parameters with tensors. Common approaches include:
- Tensor Type Attributes: Modern multi-level IRs like MLIR allow defining custom types. A quantized tensor type can directly embed the scale, zero-point, storage type (e.g.,
i8
), and quantization scheme (affine/symmetric, per-tensor/per-channel) as type attributes. This makes the quantization information an integral part of the tensor's definition.
// Example MLIR type for a per-tensor affine quantized tensor
!quant.uniform<i8:f32, 0.0039:128> // <storage_type:expressed_type, scale:zero_point>
- Metadata: Simpler IRs might attach scale and zero-point information as metadata to tensor values or operations. This can be less structured and potentially harder to verify consistently.
- Explicit Quantize/Dequantize Operations: The IR can represent quantization parameters implicitly through dedicated
quantize
and dequantize
operations that consume floating-point tensors and produce quantized tensors (or vice-versa), embedding the parameters within the operation itself. Optimization passes then work to move, fuse, or eliminate these operations.
Using type attributes (as in MLIR) is generally preferred for advanced ML compilers, as it allows for stronger type checking and more principled propagation rules defined within the dialect's semantics.
Propagation and Folding
As the compiler optimizes the computation graph, it must correctly propagate quantization parameters.
- Identity Operations: Operations like reshaping, transposing, or slicing typically preserve the quantization parameters of their input tensor. The compiler ensures the output tensor inherits the same scale and zero-point.
- Element-wise Operations (Same Parameters): For element-wise additions or subtractions where both inputs share identical scale and zero-point, the output can often retain the same parameters. However, overflow handling might necessitate adjustments or checks.
- Concatenation: When concatenating tensors along an axis, all input tensors must have the same scale and zero-point for the operation to be valid in the quantized domain without immediate requantization. The compiler must verify this or insert appropriate transformations.
- Constant Folding: Compilers can fold operations involving constants and quantization parameters. For instance, folding a
dequantize
operation applied to a constant tensor simply involves calculating the floating-point representation directly. Folding sequences like quantize(dequantize(x))
might simplify to x
if the parameters match, subject to potential precision differences.
Requantization: Handling Mismatched Parameters
Many operations, particularly element-wise additions or multiplications, produce results whose effective scale and zero-point differ from the inputs, or they combine inputs with different quantization parameters. Consider adding two tensors Ta and Tb with parameters (Sa,Za) and (Sb,Zb) respectively, aiming for an output Tc with parameters (Sc,Zc).
The ideal floating-point addition is:
realc=reala+realb=Sa(qa−Za)+Sb(qb−Zb)
We want to represent this result using the quantized value qc:
Sc(qc−Zc)=Sa(qa−Za)+Sb(qb−Zb)
Solving for qc yields:
qc=Zc+ScSa(qa−Za)+ScSb(qb−Zb)
This calculation involves floating-point ratios (Sa/Sc, Sb/Sc) and cannot be directly executed using only integer arithmetic on typical low-precision hardware. The process of calculating qc using primarily integer operations is called requantization.
Compilers implement requantization by approximating the scaling factors (Sa/Sc, Sb/Sc) using fixed-point arithmetic. This typically involves:
- Representing the scale ratio as an integer multiplier M and a right bit-shift s, such that M/2s≈Sin/Sout.
- Performing the calculation using wider integer accumulators (e.g., 32-bit integers for 8-bit inputs).
- Applying the integer multiplier M.
- Applying the right bit-shift s (an efficient division by a power of two).
- Adding the output zero-point Zc.
- Clamping the result to the valid range of the target quantized type (e.g., [−128,127] for INT8).
The compiler's role is to:
- Determine the appropriate output scale and zero-point Sc,Zc (often chosen based on the activation range observed during calibration).
- Pre-compute the integer multipliers (M) and shifts (s) for the requantization steps.
- Insert the necessary integer arithmetic instructions (multiplications, shifts, additions, clamping) into the computation graph or kernel code.
Different strategies exist for calculating M and s, often balancing accuracy and computational cost (e.g., Google's gemmlowp library approach).
Fusing Scale/Zero-Point Computations
Explicitly inserting dequantize
, requantize
, and quantize
operations introduces overhead. A significant optimization is to fuse these parameter-related calculations directly into the main computational kernels.
- Dequantization Fusion: Instead of dequantizing inputs to float and then performing an operation (like convolution), the kernel can be generated to implicitly handle the scale and zero-point during the main computation. For example, a convolution Y=Conv(X,W) might be implemented as:
Sy(qy−Zy)≈∑(Sx(qx−Zx)×Sw(qw−Zw))
The kernel uses integer arithmetic (e.g., INT8 dot products) accumulating into wider integers (e.g., INT32). The final scaling factor (SxSw/Sy) and adjustments for zero points are applied only once after the accumulation, often using the fixed-point requantization techniques described above.
- Activation Function Fusion: Activation functions (ReLU, sigmoid, etc.) applied after quantized operations can often be fused. For ReLU, the clamping inherent in requantization can sometimes implement the
max(0, x)
part. More complex functions might use lookup tables operating directly on quantized values.
- Bias Addition Fusion: Adding a floating-point bias requires scaling it appropriately to match the accumulator's scale before adding and requantizing. This scaling can be pre-computed and fused into the final stage of the kernel.
Fusion combines dequantization, computation, and requantization/quantization into a single, efficient low-precision kernel.
Hardware Influence
The target hardware significantly influences how scales and zero points are handled.
- Specialized Instructions: CPUs (AVX-VNNI, ARM Dot Product) and GPUs (Tensor Cores, Matrix Cores) often have instructions that perform fused multiply-accumulate operations directly on low-precision integers (e.g., INT8 multiplication accumulating into INT32). Compilers must target these instructions for optimal performance.
- Power-of-Two Scales: Some hardware or software libraries might prefer or require scales to be powers of two. This simplifies the requantization multiplication (Sa/Sc) into efficient bit-shifts. This constraint might be enforced during the quantization process itself or handled by the compiler during lowering.
- Per-Channel vs. Per-Tensor Support: Hardware capabilities might dictate whether per-channel quantization (different scales/zero points for each output channel of a convolution filter) can be efficiently supported. Per-channel quantization often yields better accuracy but requires more complex parameter management and potentially specialized hardware support.
Code Generation
Ultimately, the compiler translates the high-level operations and associated quantization parameters into executable code. This involves:
- Generating sequences of integer arithmetic (multiply, add, shift) for requantization steps.
- Emitting hardware-specific low-precision instructions where available.
- Managing the storage and loading of scale and zero-point constants, potentially embedding them directly as immediate values in instructions if the architecture allows.
- Generating lookup tables for complex functions operating on quantized values.
Effectively managing scales and zero points is a complex but essential task for compilers targeting low-precision execution. It requires careful representation in the IR, sophisticated propagation and transformation rules, fusion techniques, and awareness of target hardware capabilities to balance the goals of performance improvement and accuracy preservation.