Reading and understanding MLIR (Multi-Level Intermediate Representation) is a fundamental skill for anyone optimizing ML workloads at the compiler level. MLIR's structure, with its dialects and operations, provides a framework for representing computations at various abstraction levels. An analysis of some common MLIR patterns provides practical understanding of their structure and function.
We assume you have access to tools that can display or generate MLIR. Many ML compiler frameworks (like TensorFlow with XLA enabled, IREE, or projects using LLVM/MLIR directly) can dump their MLIR representations at various stages.
Let's start with a basic tensor operation: adding two 2D tensors of floating-point numbers. In a dialect like linalg (often used for linear algebra operations on tensors), this might look something like this:
#map0 = affine_map<(d0, d1) -> (d0, d1)>
module {
func.func @add_tensors(%arg0: tensor<128x256xf32>, %arg1: tensor<128x256xf32>) -> tensor<128x256xf32> {
// Allocate memory for the result (simplified representation)
%init_result = linalg.init_tensor [128, 256] : tensor<128x256xf32>
// Perform element-wise addition using linalg.generic
%result = linalg.generic {
indexing_maps = [#map0, #map0, #map0], // Input A, Input B, Output
iterator_types = ["parallel", "parallel"] // Both dimensions are parallel
}
ins(%arg0, %arg1 : tensor<128x256xf32>, tensor<128x256xf32>)
outs(%init_result : tensor<128x256xf32>) {
^bb0(%in0: f32, %in1: f32, %out_init: f32): // Basic block for the inner computation
%sum = arith.addf %in0, %in1 : f32
linalg.yield %sum : f32 // Yield the result for this element
} -> tensor<128x256xf32>
return %result : tensor<128x256xf32>
}
}
Analysis Points:
linalg.generic. The linalg dialect is designed for structured operations on tensors, often acting as a mid-level abstraction before lowering to loops or specific hardware intrinsics. The inner computation uses the arith.addf operation from the arith dialect.tensor<128x256xf32>. This immediately tells us the shape (128x256) and element type (32-bit float). This static shape information is significant for optimization.func.func defines a function. linalg.init_tensor allocates space (in practice, memory management is more complex). linalg.generic is a powerful operation defining a computation over tensor elements. Its body (^bb0) specifies the scalar computation performed for each element. arith.addf performs the floating-point addition. linalg.yield returns the computed value for the element.%arg0, %arg1) and produce SSA values as results (e.g., %result). Affine maps (#map0) define how loop iterators map to tensor indices, essential for understanding data access patterns. iterator_types specify the nature of the loops (here, "parallel" indicates no cross-iteration dependencies along those dimensions).Matrix multiplication is a foundation of many ML models. A high-level representation, perhaps in a domain-specific dialect or using linalg.matmul, might appear as:
module {
func.func @matmul(%A: tensor<32x64xf32>, %B: tensor<64x128xf32>) -> tensor<32x128xf32> {
%C = linalg.matmul ins(%A, %B : tensor<32x64xf32>, tensor<64x128xf32>)
outs(/* init tensor omitted */) -> tensor<32x128xf32>
// Assuming %C is initialized appropriately before the matmul
return %C : tensor<32x128xf32>
}
}
If we were to look at this after lowering to dialects like affine and scf (Structured Control Flow), preparing for CPU execution, we might see something like this (highly simplified):
#map_A = affine_map<(d0, d1, d2) -> (d0, d2)> // Access A[i, k]
#map_B = affine_map<(d0, d1, d2) -> (d2, d1)> // Access B[k, j]
#map_C = affine_map<(d0, d1, d2) -> (d0, d1)> // Access C[i, j]
module {
func.func @matmul_lowered(%argA: memref<32x64xf32>, %argB: memref<64x128xf32>, %argC: memref<32x128xf32>) {
// Outer loops for result dimensions (i, j)
scf.for %i = 0 to 32 step 1 {
scf.for %j = 0 to 128 step 1 {
// Inner reduction loop (k)
%init_acc = arith.constant 0.0 : f32
%acc = scf.for %k = 0 to 64 step 1 iter_args(%iter_acc = %init_acc) -> (f32) {
// Load elements from input tensors
%a_val = affine.load %argA[%i, %k] : memref<32x64xf32> // Simplified access
%b_val = affine.load %argB[%k, %j] : memref<64x128xf32> // Simplified access
// Compute product and accumulate
%prod = arith.mulf %a_val, %b_val : f32
%new_acc = arith.addf %iter_acc, %prod : f32
scf.yield %new_acc : f32
}
// Store the final accumulated value
affine.store %acc, %argC[%i, %j] : memref<32x128xf32> // Simplified access
}
}
return
}
}
Analysis Points (Lowered Example):
linalg.matmul operation is gone. We now see scf.for for loops, affine.load/affine.store for memory access (using memref types instead of tensor), and arith operations for the core computation. This represents a lowering step.memref (memory reference) instead of tensor. memref typically implies allocated memory with a specific layout, whereas tensor is more abstract. The affine maps (though simplified here) become critical for analyzing memory access patterns for cache optimization.When examining MLIR code generated by different tools or at different compilation stages, consider these questions:
mhlo for high-level graph ops, linalg for structured tensor ops, vector for SIMD, affine/scf for loops/memory, gpu for GPU specifics, llvm for final lowering).linalg.generic often results from fusion).matmul became nested loops? How might this map to hardware (e.g., parallel loops to GPU threads/blocks)?This hands-on analysis is key. By reading and interpreting MLIR, you gain insight into how ML compilers structure computations and apply optimizations. It allows you to understand the impact of different compilation strategies and pinpoint areas for performance improvement, looking past the black-box view of ML frameworks.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with