Reading and understanding MLIR (Multi-Level Intermediate Representation) is a fundamental skill for anyone working on optimizing ML workloads at the compiler level. As we've discussed, MLIR's structure, with its dialects and operations, provides a framework for representing computations at various abstraction levels. This practical section will guide you through analyzing some common MLIR patterns, helping you connect the theoretical concepts to concrete code.
We assume you have access to tools that can display or generate MLIR. Many ML compiler frameworks (like TensorFlow with XLA enabled, IREE, or projects using LLVM/MLIR directly) can dump their MLIR representations at various stages.
Let's start with a basic tensor operation: adding two 2D tensors of floating-point numbers. In a dialect like linalg
(often used for linear algebra operations on tensors), this might look something like this:
#map0 = affine_map<(d0, d1) -> (d0, d1)>
module {
func.func @add_tensors(%arg0: tensor<128x256xf32>, %arg1: tensor<128x256xf32>) -> tensor<128x256xf32> {
// Allocate memory for the result (simplified representation)
%init_result = linalg.init_tensor [128, 256] : tensor<128x256xf32>
// Perform element-wise addition using linalg.generic
%result = linalg.generic {
indexing_maps = [#map0, #map0, #map0], // Input A, Input B, Output
iterator_types = ["parallel", "parallel"] // Both dimensions are parallel
}
ins(%arg0, %arg1 : tensor<128x256xf32>, tensor<128x256xf32>)
outs(%init_result : tensor<128x256xf32>) {
^bb0(%in0: f32, %in1: f32, %out_init: f32): // Basic block for the inner computation
%sum = arith.addf %in0, %in1 : f32
linalg.yield %sum : f32 // Yield the result for this element
} -> tensor<128x256xf32>
return %result : tensor<128x256xf32>
}
}
Analysis Points:
linalg.generic
. The linalg
dialect is designed for structured operations on tensors, often acting as a mid-level abstraction before lowering to loops or specific hardware intrinsics. The inner computation uses the arith.addf
operation from the arith
dialect.tensor<128x256xf32>
. This immediately tells us the shape (128x256) and element type (32-bit float). This static shape information is significant for optimization.func.func
defines a function. linalg.init_tensor
conceptually allocates space (in practice, memory management is more complex). linalg.generic
is a powerful operation defining a computation over tensor elements. Its body (^bb0
) specifies the scalar computation performed for each element. arith.addf
performs the floating-point addition. linalg.yield
returns the computed value for the element.%arg0
, %arg1
) and produce SSA values as results (e.g., %result
). Affine maps (#map0
) define how loop iterators map to tensor indices, essential for understanding data access patterns. iterator_types
specify the nature of the loops (here, "parallel" indicates no cross-iteration dependencies along those dimensions).Matrix multiplication is a cornerstone of many ML models. A high-level representation, perhaps in a domain-specific dialect or using linalg.matmul
, might appear as:
module {
func.func @matmul(%A: tensor<32x64xf32>, %B: tensor<64x128xf32>) -> tensor<32x128xf32> {
%C = linalg.matmul ins(%A, %B : tensor<32x64xf32>, tensor<64x128xf32>)
outs(/* init tensor omitted */) -> tensor<32x128xf32>
// Assuming %C is initialized appropriately before the matmul
return %C : tensor<32x128xf32>
}
}
If we were to look at this after lowering to dialects like affine
and scf
(Structured Control Flow), preparing for CPU execution, we might see something conceptually like this (highly simplified):
#map_A = affine_map<(d0, d1, d2) -> (d0, d2)> // Access A[i, k]
#map_B = affine_map<(d0, d1, d2) -> (d2, d1)> // Access B[k, j]
#map_C = affine_map<(d0, d1, d2) -> (d0, d1)> // Access C[i, j]
module {
func.func @matmul_lowered(%argA: memref<32x64xf32>, %argB: memref<64x128xf32>, %argC: memref<32x128xf32>) {
// Outer loops for result dimensions (i, j)
scf.for %i = 0 to 32 step 1 {
scf.for %j = 0 to 128 step 1 {
// Inner reduction loop (k)
%init_acc = arith.constant 0.0 : f32
%acc = scf.for %k = 0 to 64 step 1 iter_args(%iter_acc = %init_acc) -> (f32) {
// Load elements from input tensors
%a_val = affine.load %argA[%i, %k] : memref<32x64xf32> // Simplified access
%b_val = affine.load %argB[%k, %j] : memref<64x128xf32> // Simplified access
// Compute product and accumulate
%prod = arith.mulf %a_val, %b_val : f32
%new_acc = arith.addf %iter_acc, %prod : f32
scf.yield %new_acc : f32
}
// Store the final accumulated value
affine.store %acc, %argC[%i, %j] : memref<32x128xf32> // Simplified access
}
}
return
}
}
Analysis Points (Lowered Example):
linalg.matmul
operation is gone. We now see scf.for
for loops, affine.load
/affine.store
for memory access (using memref
types instead of tensor
), and arith
operations for the core computation. This represents a lowering step.memref
(memory reference) instead of tensor
. memref
typically implies allocated memory with a specific layout, whereas tensor
is more abstract. The affine maps (though simplified here) become critical for analyzing memory access patterns for cache optimization.When examining MLIR code generated by different tools or at different compilation stages, consider these questions:
mhlo
for high-level graph ops, linalg
for structured tensor ops, vector
for SIMD, affine
/scf
for loops/memory, gpu
for GPU specifics, llvm
for final lowering).linalg.generic
often results from fusion).matmul
became nested loops? How might this map to hardware (e.g., parallel loops to GPU threads/blocks)?This hands-on analysis is vital. By reading and interpreting MLIR, you gain insight into how ML compilers structure computations and apply optimizations. It allows you to understand the impact of different compilation strategies and pinpoint areas for performance improvement, moving beyond the black-box view of ML frameworks.
© 2025 ApX Machine Learning