Linear algebra serves as the mathematical engine for deep learning, yet representing these operations in a compiler is surprisingly difficult. Traditional intermediate representations (IRs) often lower matrix operations into loops too early, obscuring the geometric structure required for advanced optimizations like tiling and vectorization. The MLIR linalg dialect solves this by treating linear algebra operations as first-class citizens. It provides a structured abstraction that separates the control flow of loops from the actual mathematical computation.This separation allows the compiler to reason about iteration spaces and data layouts mathematically before generating concrete code. In this section, we examine the linalg.generic operation, the system of affine maps that define data access, and how this infrastructure enables high-performance code generation.The Philosophy of Structured OperationsThe core design principle of the linalg dialect is the "Structured Operation." Unlike a standard function call or a loop nest, a structured operation explicitly declares its input and output requirements, its iteration space, and the algebraic properties of its loop dimensions.In standard LLVM IR, a matrix multiplication is just a sequence of loads, stores, and arithmetic instructions buried inside three nested loops. In linalg, it is a single operation that declares:Inputs and Outputs: Which tensors or memory buffers are read and written.Indexing Maps: How the loop indices map to the dimensions of the data.Iterator Types: Which loops are parallelizable and which perform reductions.This higher level of abstraction allows the compiler to perform aggressive transformations, such as fusing a ReLU activation into a matrix multiplication, without complex dependence analysis. The compiler knows exactly which memory regions are accessed and how, making transformations valid by construction.digraph LinalgStructure { rankdir=LR; node [shape=box, style=filled, fontname="Helvetica", color="#dee2e6", fillcolor="#f8f9fa"]; edge [fontname="Helvetica", color="#adb5bd"]; Inputs [label="Input Tensors\n(A, B)", fillcolor="#a5d8ff"]; Maps [label="Indexing Maps\n(Affine Transforms)", fillcolor="#b197fc"]; Iterators [label="Iterator Types\n(Parallel, Reduction)", fillcolor="#ffc9c9"]; Generic [label="linalg.generic\nOperation", fillcolor="#69db7c", penwidth=2]; Payload [label="Payload Region\n(Scalar Math)", shape=note, fillcolor="#ffec99"]; Output [label="Output Tensor\n(C)", fillcolor="#a5d8ff"]; Inputs -> Generic; Maps -> Generic; Iterators -> Generic; Generic -> Output; Generic -> Payload [style=dashed, label="yields"]; }Structure of a linalg.generic operation highlighting the separation between data definitions and the computation payload.The linalg.generic OperationThe linalg.generic op is the general form from which all other named operations (like linalg.matmul or linalg.conv_2d) are derived. Understanding generic is essential because the compiler canonicalizes named operations into this form during optimization.A linalg.generic operation defines a computation over a loop nest. It requires specific attributes to guide the compiler.Indexing MapsIndexing maps are affine maps that define how loop iteration indices translate to tensor coordinates. They are written in the form $$(d_0, d_1, \dots) \to (e_0, e_1, \dots)$$.Consider a matrix multiplication $$C_{ij} = \sum_k A_{ik} B_{kj}$$. There are three iteration variables: $i, j, k$.$A$ is accessed as $$A[i, k]$$.$B$ is accessed as $$B[k, j]$$.$C$ is accessed as $$C[i, j]$$.In MLIR, we define three affine maps to represent these accesses relative to the iteration domain $(i, j, k)$:Map for A: $$(i, j, k) \to (i, k)$$Map for B: $$(i, j, k) \to (k, j)$$Map for C: $$(i, j, k) \to (i, j)$$Iterator TypesThe dialect distinguishes between dimensions that can be executed in parallel and those that enforce an ordering (reduction). For matrix multiplication, $i$ and $j$ are parallel iterators because each element of $C$ can be computed independently. The $k$ dimension is a reduction iterator because it accumulates values into the same memory location.The Payload RegionThe body of the operation (the region) contains the scalar implementation. It describes what happens to a single element. For matmul, this is a multiply-accumulate operation.Here is how a matrix multiplication looks in the linalg.generic form:#map_a = affine_map<(i, j, k) -> (i, k)> #map_b = affine_map<(i, j, k) -> (k, j)> #map_c = affine_map<(i, j, k) -> (i, j)> func.func @matmul_generic(%A: tensor<128x128xf32>, %B: tensor<128x128xf32>, %C_init: tensor<128x128xf32>) -> tensor<128x128xf32> { %result = linalg.generic { indexing_maps = [#map_a, #map_b, #map_c], iterator_types = ["parallel", "parallel", "reduction"] } ins(%A, %B : tensor<128x128xf32>, tensor<128x128xf32>) outs(%C_init : tensor<128x128xf32>) { ^bb0(%a_elem: f32, %b_elem: f32, %c_elem: f32): // The scalar computation payload %prod = arith.mulf %a_elem, %b_elem : f32 %sum = arith.addf %c_elem, %prod : f32 linalg.yield %sum : f32 } return %result : tensor<128x128xf32> }In this snippet, the ins and outs define the data operands. The region ^bb0 receives scalar elements mapped by the affine maps. The linalg.yield takes the computed value and updates the output tensor.Tiling and Fusion StrategyOne of the primary advantages of Linalg is that tiling is implemented as a transformation on the IR itself, rather than just loop generation logic. Tiling a linalg op produces a loop nest (often using the scf or Structured Control Flow dialect) where the inner body is a smaller linalg op representing the tile.This recursive definition preserves the semantics of the operation at every level of the loop nest. If you tile a matrix multiplication, the inner kernel is still a matrix multiplication, just on smaller views of the data.Visualizing Tiling InteractionsWhen we tile an operation, we essentially introduce new loops that iterate over blocks of data. The linalg infrastructure calculates the necessary sub-views (slices) of the input tensors automatically based on the affine maps provided in the definition.{ "layout": { "title": "Memory Access Pattern for Tiled MatMul (Block 32x32)", "xaxis": {"title": "Column Index (N)", "showgrid": false, "zeroline": false}, "yaxis": {"title": "Row Index (M)", "showgrid": false, "zeroline": false, "autorange": "reversed"}, "plot_bgcolor": "#ffffff", "width": 600, "height": 500 }, "data": [ { "z": [ [1, 1, 0, 0, 0, 0, 0, 0], [1, 1, 0, 0, 0, 0, 0, 0], [0, 0, 2, 2, 0, 0, 0, 0], [0, 0, 2, 2, 0, 0, 0, 0], [0, 0, 0, 0, 3, 3, 0, 0], [0, 0, 0, 0, 3, 3, 0, 0], [0, 0, 0, 0, 0, 0, 4, 4], [0, 0, 0, 0, 0, 0, 4, 4] ], "type": "heatmap", "colorscale": [ [0, "#e9ecef"], [0.25, "#74c0fc"], [0.5, "#63e6be"], [0.75, "#ffd43b"], [1, "#ff8787"] ], "showscale": false } ] }Tiling creates distinct access blocks. The heatmap illustrates four distinct tiles being processed, showing how the global operation is decomposed into local structured operations.Operator FusionLinalg excels at operator fusion. Because the generic op exposes the element-wise access pattern, the compiler can fuse a producer operation (like a linalg.add) directly into the consumer operation (like a linalg.matmul).Fusion in Linalg is generally achieved by "tiling the consumer and fusing the producer." The compiler tiles the consumer operation first. Then, for each tile of the consumer, it computes only the slice of the producer required for that tile. This improves cache locality by computing values immediately before they are consumed, keeping data in registers or L1 cache.Bufferization: From Tensors to MemrefsMLIR allows operations to work on tensor types (immutable values, useful for mathematical reasoning) or memref types (mutable memory buffers, representing physical hardware memory). High-level optimization usually happens on tensors. However, hardware executes instructions on memory.The process of converting tensor-based Linalg operations to buffer-based operations is called Bufferization.One-Shot Bufferization: Modern MLIR uses a "one-shot" analysis that processes the entire module. It determines where in-place updates are safe (reusing input memory for output) and where copies are necessary to preserve value semantics.Destination-Passing Style: Linalg operations on tensors typically take an "out" operand (like %C_init in the example above). In tensor space, this acts as an initialization value. In buffer space, this becomes the memory buffer where the result is written.Once an operation is bufferized, it can be lowered to loops (scf.for) and eventually to the LLVM dialect for machine code generation. The Linalg dialect effectively acts as the bridge between the abstract definition of a tensor program and the concrete reality of pointer arithmetic.