Effectively optimizing low-precision models requires the compiler's intermediate representation (IR) to explicitly capture the nuances of quantization. Simply using standard integer types like i8
is insufficient because it omits critical information about how these integers map back to the real-number domain they approximate. The IR must encode the quantization scheme, parameters, and the operations that convert between floating-point and quantized domains.
A robust IR representation acts as the contract between high-level model descriptions (potentially annotated with quantization information) and the low-level optimization and code generation passes. Without this explicit representation, the compiler cannot reason about the numerical properties of quantized operations or target specialized low-precision hardware instructions effectively.
The core challenge is representing the affine mapping between the real numbers (r) and the quantized integers (q):
r=s×(q−Z)Here, s is the scale (a positive float) and Z is the zero-point (an integer matching the range of q). Both s and Z must be somehow associated with the quantized tensor data within the IR. Several strategies exist:
Dedicated Quantized Types: This is arguably the cleanest approach, often seen in multi-level IRs like MLIR. New types are defined that directly bundle the storage type (e.g., i8
, u8
) with the quantization parameters and potentially the "expressed" floating-point type it represents.
tensor<1x256x256x3x !quant.uniform<i8:f32, 0.015:128>>
. This clearly defines a tensor with dimensions 1x256x256x3
, where each element is stored as an i8
. This i8
represents an f32
value using a uniform affine quantization scheme (!quant.uniform
) with a scale of 0.015
and a zero-point of 128
.Type Attributes or Metadata: An alternative is to use standard integer types (e.g., tensor<1x256x256x3 x i8>
) but attach the quantization parameters (scale
, zero_point
, axis
for per-channel) as attributes or metadata to the tensor value or the operations producing/consuming it.
%input_quant = MyDialect.Quantize(%input_fp32) {scale=0.015, zero_point=128, storage_type=i8} : (tensor<1x...xf32>) -> tensor<1x...xi8>
%weight_quant = GetQuantizedWeight() {scale=0.008, zero_point=0, storage_type=i8, axis=0} : () -> tensor<64x...xi8>
%conv_output_quant = MyDialect.Conv2D(%input_quant, %weight_quant) {output_scale=0.1, output_zero_point=110, storage_type=i8} : (tensor<...xi8>, tensor<...xi8>) -> tensor<...xi8>
Models often use different quantization parameters for different parts of a tensor, particularly weights (per-channel or per-axis quantization). The IR must support this:
Explicit operations are needed in the IR to represent the transitions between floating-point and quantized domains:
quant
): Takes a floating-point tensor and quantization parameters (s, Z) as input, and produces a quantized integer tensor.
q=round(r/s)+Z
The IR operation node would reference the input tensor and the scale/zero-point values (or the target quantized type which implies them).dequant
): Takes a quantized integer tensor and its associated parameters (s, Z) as input, and produces a floating-point tensor.
r=s×(q−Z)
Similarly, the IR node references the quantized input and its parameters.requant
): Takes a quantized tensor (often an intermediate result with higher precision, e.g., i32
accumulator) and parameters for both the input and desired output types. It performs the scale adjustment and potential down-casting in the quantized domain, avoiding a costly round-trip through floating-point.
qout=round((sin/sout)×(qin−Zin))+Zout
This operation is fundamental for optimizing sequences of quantized computations, especially convolutions and matrix multiplications where intermediate accumulations occur.Below is a conceptual diagram showing how these operations might appear in a graph IR fragment:
A conceptual flow showing quantization of input, a quantized convolution producing a higher-precision accumulator, requantization back to INT8, and final dequantization to FP32.
Having this explicit representation allows the compiler to:
quantize
-> op
-> dequantize
sequences into dedicated quantized kernel calls. Fuse requantize
operations with preceding computations.MyDialect.Conv2D
operating on quantized types) to specific low-precision hardware instructions (e.g., INT8 dot products) or optimized library calls (e.g., cuDNN, MIOpen, oneDNN).In summary, representing quantized operations and their associated parameters directly and explicitly within the compiler's IR is fundamental. Whether through dedicated types or attribute systems, this representation provides the necessary information for sophisticated optimization passes to analyze, transform, and generate highly efficient low-precision code tailored for modern hardware. It bridges the gap between the high-level intent of using quantization and the low-level execution details required for performance.
© 2025 ApX Machine Learning