Linear algebra serves as the mathematical engine for deep learning, yet representing these operations in a compiler is surprisingly difficult. Traditional intermediate representations (IRs) often lower matrix operations into loops too early, obscuring the geometric structure required for advanced optimizations like tiling and vectorization. The MLIR linalg dialect solves this by treating linear algebra operations as first-class citizens. It provides a structured abstraction that separates the control flow of loops from the actual mathematical computation.
This separation allows the compiler to reason about iteration spaces and data layouts mathematically before generating concrete code. In this section, we examine the linalg.generic operation, the system of affine maps that define data access, and how this infrastructure enables high-performance code generation.
The core design principle of the linalg dialect is the "Structured Operation." Unlike a standard function call or a loop nest, a structured operation explicitly declares its input and output requirements, its iteration space, and the algebraic properties of its loop dimensions.
In standard LLVM IR, a matrix multiplication is just a sequence of loads, stores, and arithmetic instructions buried inside three nested loops. In linalg, it is a single operation that declares:
This higher level of abstraction allows the compiler to perform aggressive transformations, such as fusing a ReLU activation into a matrix multiplication, without complex dependence analysis. The compiler knows exactly which memory regions are accessed and how, making transformations valid by construction.
Structure of a
linalg.genericoperation highlighting the separation between data definitions and the computation payload.
The linalg.generic op is the general form from which all other named operations (like linalg.matmul or linalg.conv_2d) are derived. Understanding generic is essential because the compiler canonicalizes named operations into this form during optimization.
A linalg.generic operation defines a computation over a loop nest. It requires specific attributes to guide the compiler.
Indexing maps are affine maps that define how loop iteration indices translate to tensor coordinates. They are written in the form .
Consider a matrix multiplication . There are three iteration variables: .
In MLIR, we define three affine maps to represent these accesses relative to the iteration domain :
The dialect distinguishes between dimensions that can be executed in parallel and those that enforce an ordering (reduction). For matrix multiplication, and are parallel iterators because each element of can be computed independently. The dimension is a reduction iterator because it accumulates values into the same memory location.
The body of the operation (the region) contains the scalar implementation. It describes what happens to a single element. For matmul, this is a multiply-accumulate operation.
Here is how a matrix multiplication looks in the linalg.generic form:
#map_a = affine_map<(i, j, k) -> (i, k)>
#map_b = affine_map<(i, j, k) -> (k, j)>
#map_c = affine_map<(i, j, k) -> (i, j)>
func.func @matmul_generic(%A: tensor<128x128xf32>,
%B: tensor<128x128xf32>,
%C_init: tensor<128x128xf32>) -> tensor<128x128xf32> {
%result = linalg.generic {
indexing_maps = [#map_a, #map_b, #map_c],
iterator_types = ["parallel", "parallel", "reduction"]
} ins(%A, %B : tensor<128x128xf32>, tensor<128x128xf32>)
outs(%C_init : tensor<128x128xf32>) {
^bb0(%a_elem: f32, %b_elem: f32, %c_elem: f32):
// The scalar computation payload
%prod = arith.mulf %a_elem, %b_elem : f32
%sum = arith.addf %c_elem, %prod : f32
linalg.yield %sum : f32
}
return %result : tensor<128x128xf32>
}
In this snippet, the ins and outs define the data operands. The region ^bb0 receives scalar elements mapped by the affine maps. The linalg.yield takes the computed value and updates the output tensor.
One of the primary advantages of Linalg is that tiling is implemented as a transformation on the IR itself, rather than just loop generation logic. Tiling a linalg op produces a loop nest (often using the scf or Structured Control Flow dialect) where the inner body is a smaller linalg op representing the tile.
This recursive definition preserves the semantics of the operation at every level of the loop nest. If you tile a matrix multiplication, the inner kernel is still a matrix multiplication, just on smaller views of the data.
When we tile an operation, we essentially introduce new loops that iterate over blocks of data. The linalg infrastructure calculates the necessary sub-views (slices) of the input tensors automatically based on the affine maps provided in the definition.
Tiling creates distinct access blocks. The heatmap illustrates four distinct tiles being processed, showing how the global operation is decomposed into local structured operations.
Linalg excels at operator fusion. Because the generic op exposes the element-wise access pattern, the compiler can fuse a producer operation (like a linalg.add) directly into the consumer operation (like a linalg.matmul).
Fusion in Linalg is generally achieved by "tiling the consumer and fusing the producer." The compiler tiles the consumer operation first. Then, for each tile of the consumer, it computes only the slice of the producer required for that tile. This improves cache locality by computing values immediately before they are consumed, keeping data in registers or L1 cache.
MLIR allows operations to work on tensor types (immutable values, useful for mathematical reasoning) or memref types (mutable memory buffers, representing physical hardware memory). High-level optimization usually happens on tensors. However, hardware executes instructions on memory.
The process of converting tensor-based Linalg operations to buffer-based operations is called Bufferization.
%C_init in the example above). In tensor space, this acts as an initialization value. In buffer space, this becomes the memory buffer where the result is written.Once an operation is bufferized, it can be lowered to loops (scf.for) and eventually to the LLVM dialect for machine code generation. The Linalg dialect effectively acts as the bridge between the abstract definition of a tensor program and the concrete reality of pointer arithmetic.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with