All Courses

Hands-on Practical: Analyzing MLIR Representations

Reading and understanding MLIR (Multi-Level Intermediate Representation) is a fundamental skill for anyone working on optimizing ML workloads at the compiler level. As we've discussed, MLIR's structure, with its dialects and operations, provides a framework for representing computations at various abstraction levels. This practical section will guide you through analyzing some common MLIR patterns, helping you connect the theoretical concepts to concrete code.

We assume you have access to tools that can display or generate MLIR. Many ML compiler frameworks (like TensorFlow with XLA enabled, IREE, or projects using LLVM/MLIR directly) can dump their MLIR representations at various stages.

Example 1: Element-wise Addition

Let's start with a basic tensor operation: adding two 2D tensors of floating-point numbers. In a dialect like linalg (often used for linear algebra operations on tensors), this might look something like this:

#map0 = affine_map<(d0, d1) -> (d0, d1)>

module {
  func.func @add_tensors(%arg0: tensor<128x256xf32>, %arg1: tensor<128x256xf32>) -> tensor<128x256xf32> {
    // Allocate memory for the result (simplified representation)
    %init_result = linalg.init_tensor [128, 256] : tensor<128x256xf32>

    // Perform element-wise addition using linalg.generic
    %result = linalg.generic {
      indexing_maps = [#map0, #map0, #map0], // Input A, Input B, Output
      iterator_types = ["parallel", "parallel"] // Both dimensions are parallel
    }
    ins(%arg0, %arg1 : tensor<128x256xf32>, tensor<128x256xf32>)
    outs(%init_result : tensor<128x256xf32>) {
    ^bb0(%in0: f32, %in1: f32, %out_init: f32): // Basic block for the inner computation
      %sum = arith.addf %in0, %in1 : f32
      linalg.yield %sum : f32 // Yield the result for this element
    } -> tensor<128x256xf32>

    return %result : tensor<128x256xf32>
  }
}

Analysis Points:

Dialect: The primary operation here is linalg.generic. The linalg dialect is designed for structured operations on tensors, often acting as a mid-level abstraction before lowering to loops or specific hardware intrinsics. The inner computation uses the arith.addf operation from the arith dialect.
Types: Notice the explicit tensor types: tensor<128x256xf32>. This immediately tells us the shape (128x256) and element type (32-bit float). This static shape information is significant for optimization.
Operations: func.func defines a function. linalg.init_tensor allocates space (in practice, memory management is more complex). linalg.generic is a powerful operation defining a computation over tensor elements. Its body (^bb0) specifies the scalar computation performed for each element. arith.addf performs the floating-point addition. linalg.yield returns the computed value for the element.
Structure: Operations take SSA (Static Single Assignment) values as operands (e.g., %arg0, %arg1) and produce SSA values as results (e.g., %result). Affine maps (#map0) define how loop iterators map to tensor indices, essential for understanding data access patterns. iterator_types specify the nature of the loops (here, "parallel" indicates no cross-iteration dependencies along those dimensions).

Example 2: Matrix Multiplication

Matrix multiplication is a foundation of many ML models. A high-level representation, perhaps in a domain-specific dialect or using linalg.matmul, might appear as:

module {
  func.func @matmul(%A: tensor<32x64xf32>, %B: tensor<64x128xf32>) -> tensor<32x128xf32> {
    %C = linalg.matmul ins(%A, %B : tensor<32x64xf32>, tensor<64x128xf32>)
                       outs(/* init tensor omitted */) -> tensor<32x128xf32>
    // Assuming %C is initialized appropriately before the matmul
    return %C : tensor<32x128xf32>
  }
}

If we were to look at this after lowering to dialects like affine and scf (Structured Control Flow), preparing for CPU execution, we might see something like this (highly simplified):

#map_A = affine_map<(d0, d1, d2) -> (d0, d2)> // Access A[i, k]
#map_B = affine_map<(d0, d1, d2) -> (d2, d1)> // Access B[k, j]
#map_C = affine_map<(d0, d1, d2) -> (d0, d1)> // Access C[i, j]

module {
  func.func @matmul_lowered(%argA: memref<32x64xf32>, %argB: memref<64x128xf32>, %argC: memref<32x128xf32>) {
    // Outer loops for result dimensions (i, j)
    scf.for %i = 0 to 32 step 1 {
      scf.for %j = 0 to 128 step 1 {
        // Inner reduction loop (k)
        %init_acc = arith.constant 0.0 : f32
        %acc = scf.for %k = 0 to 64 step 1 iter_args(%iter_acc = %init_acc) -> (f32) {
          // Load elements from input tensors
          %a_val = affine.load %argA[%i, %k] : memref<32x64xf32> // Simplified access
          %b_val = affine.load %argB[%k, %j] : memref<64x128xf32> // Simplified access

          // Compute product and accumulate
          %prod = arith.mulf %a_val, %b_val : f32
          %new_acc = arith.addf %iter_acc, %prod : f32
          scf.yield %new_acc : f32
        }
        // Store the final accumulated value
        affine.store %acc, %argC[%i, %j] : memref<32x128xf32> // Simplified access
      }
    }
    return
  }
}

Analysis Points (Lowered Example):

Dialect Shift: The linalg.matmul operation is gone. We now see scf.for for loops, affine.load/affine.store for memory access (using memref types instead of tensor), and arith operations for the core computation. This represents a lowering step.
Abstraction Level: This is much closer to traditional imperative code. The loop structure, memory accesses, and arithmetic are explicit. This level is suitable for transformations like tiling, vectorization, and eventual code generation to machine instructions.
Types: Note the use of memref (memory reference) instead of tensor. memref typically implies allocated memory with a specific layout, whereas tensor is more abstract. The affine maps (though simplified here) become critical for analyzing memory access patterns for cache optimization.

Guidance for Your Analysis

When examining MLIR code generated by different tools or at different compilation stages, consider these questions:

What dialects are present? This tells you the current level of abstraction (e.g., mhlo for high-level graph ops, linalg for structured tensor ops, vector for SIMD, affine/scf for loops/memory, gpu for GPU specifics, llvm for final lowering).
What are the main operations? Look for operations representing computations, control flow, or type conversions. Understand their operands and results.
How are tensor shapes and types represented? Are shapes static or dynamic? What is the element type? This heavily influences optimization choices.
How does data flow? Trace the SSA values from producers to consumers.
Can you spot potential high-level optimizations? Look for sequences of operations that could be fused (linalg.generic often results from fusion).
If lowered, how does the structure relate to the original operation? Can you see how a matmul became nested loops? How might this map to hardware (e.g., parallel loops to GPU threads/blocks)?
Are there attributes defining operation behavior? Look for attributes on operations that specify strides, padding, data layouts, or precision details.

This hands-on analysis is key. By reading and interpreting MLIR, you gain insight into how ML compilers structure computations and apply optimizations. It allows you to understand the impact of different compilation strategies and pinpoint areas for performance improvement, looking past the black-box view of ML frameworks.

Was this section helpful?