Implementing a custom dialect shifts the role from simply using MLIR's existing functionalities to actively building new compiler components. While predefined dialects like linalg and affine cover standard tensor operations, bespoke hardware accelerators or domain-specific frameworks often require operations that do not map cleanly to existing IR structures. By defining a custom dialect, you establish a contract between the high-level logic of your machine learning model and the optimization passes that follow.
This practical session focuses on the Operation Definition Specification (ODS) framework using TableGen. We will construct a minimal dialect named tmath (Tensor Math) designed to handle simplified matrix operations. We will then implement a lowering pass that transforms these high-level operations into the affine dialect, enabling the polyhedral optimizations discussed in previous chapters.
Writing C++ boilerplate for every compiler operation is error-prone and difficult to maintain. MLIR solves this by using TableGen, a record-based domain-specific language, to define the structure, constraints, and verification logic of operations. The build system then generates the corresponding C++ classes (headers and implementations) automatically.
The workflow involves three primary stages: defining the dialect, defining the operations, and linking the generated artifacts into the main compiler binary.
The build process where TableGen translates declarative specifications into C++ source code that is subsequently compiled into the final binary.
The dialect definition serves as the namespace and registration point for your operations. In a file named TMathDialect.td, we inherit from the Dialect class provided by ODS.
// TMathDialect.td
include "mlir/IR/OpBase.td"
def TMath_Dialect : Dialect {
let name = "tmath";
let summary = "A minimal tensor math dialect for demonstration";
let description = [{
The tmath dialect provides high-level matrix operations
designed to be lowered to affine loops.
}];
let cppNamespace = "::mlir::tmath";
}
This definition generates a C++ class TMathDialect in the specified namespace. The name field determines the prefix for the IR, so operations will appear as tmath.op_name in the text format.
With the dialect established, we define the operations. An operation in MLIR is characterized by its name, arguments (operands and attributes), results, and traits. Traits allow the compiler to reason about the behavior of the operation, such as whether it has side effects or if the result type depends on the input type.
We will define a MatMulOp. Unlike the generic linalg.matmul, our operation will enforce strictly ranked 2D tensors to simplify the lowering logic.
// TMathOps.td
include "TMathDialect.td"
include "mlir/Interfaces/SideEffectInterfaces.td"
def MatMulOp : Op<TMath_Dialect, "matmul", [Pure]> {
let summary = "Performs matrix multiplication";
let description = [{
Computes the product of two 2D tensors.
Returns a new tensor with dimensions [M, N] given inputs [M, K] and [K, N].
}];
let arguments = (ins
F32Tensor:$lhs,
F32Tensor:$rhs
);
let results = (outs F32Tensor:$result);
let assemblyFormat = "$lhs `,` $rhs attr-dict `:` type($lhs) `*` type($rhs) `->` type($result)";
let hasVerifier = 1;
}
In this specification:
[Pure] Trait: Indicates that the operation does not access memory or global state essentially, allowing Dead Code Elimination (DCE) to remove it if unused.F32Tensor. ODS automatically generates type checking code to ensure only 32-bit floating-point tensors are accepted.hasVerifier = 1 tells TableGen that we will provide a C++ implementation to validate runtime constraints, such as ensuring the inner dimensions of the matrices match ( in ).TableGen generates the declaration, but we must implement the logic in the corresponding .cpp file. This verification step is important for catching shape mismatches early in the compilation pipeline.
// TMathOps.cpp
llvm::LogicalResult MatMulOp::verify() {
auto lhsType = getLhs().getType().cast<RankedTensorType>();
auto rhsType = getRhs().getType().cast<RankedTensorType>();
if (lhsType.getRank() != 2 || rhsType.getRank() != 2)
return emitOpError("operands must be 2D tensors");
if (lhsType.getDimSize(1) != rhsType.getDimSize(0)) {
return emitOpError("dimension mismatch: lhs column size ")
<< lhsType.getDimSize(1) << " must match rhs row size "
<< rhsType.getDimSize(0);
}
return success();
}
A dialect is only useful if it can be translated into executable code. We will implement a rewrite pattern that lowers tmath.matmul into the affine dialect. This transformation replaces the high-level matrix multiplication with three nested loops, explicit loads, and stores.
The pattern rewriting infrastructure in MLIR revolves around the OpRewritePattern class. We override the matchAndRewrite method.
struct MatMulLowering : public OpRewritePattern<tmath::MatMulOp> {
using OpRewritePattern<tmath::MatMulOp>::OpRewritePattern;
LogicalResult matchAndRewrite(tmath::MatMulOp op, PatternRewriter &rewriter) const override {
auto loc = op.getLoc();
Value lhs = op.getLhs();
Value rhs = op.getRhs();
// Get shapes
auto lhsType = lhs.getType().cast<RankedTensorType>();
auto rhsType = rhs.getType().cast<RankedTensorType>();
int64_t M = lhsType.getDimSize(0);
int64_t K = lhsType.getDimSize(1);
int64_t N = rhsType.getDimSize(1);
// Allocate buffer for the result using memref
auto resultMemRefType = MemRefType::get({M, N}, rewriter.getF32Type());
Value resultAlloc = rewriter.create<memref::AllocOp>(loc, resultMemRefType);
// Create Affine Maps for the loops
// Loop i from 0 to M
// Loop j from 0 to N
// Loop k from 0 to K
// Build the loop nest
buildAffineLoopNest(rewriter, loc, {M, N, K},
[&](OpBuilder &b, Location loc, ValueRange ivs) {
Value i = ivs[0];
Value j = ivs[1];
Value k = ivs[2];
// Load A[i, k]
Value aVal = b.create<affine::AffineLoadOp>(loc, lhs, ValueRange{i, k});
// Load B[k, j]
Value bVal = b.create<affine::AffineLoadOp>(loc, rhs, ValueRange{k, j});
// Compute product
Value product = b.create<arith::MulFOp>(loc, aVal, bVal);
// Load current C[i, j] (accumulator)
Value cVal = b.create<affine::AffineLoadOp>(loc, resultAlloc, ValueRange{i, j});
// Accumulate
Value sum = b.create<arith::AddFOp>(loc, cVal, product);
// Store result back
b.create<affine::AffineStoreOp>(loc, sum, resultAlloc, ValueRange{i, j});
}
);
// Since we moved from Tensor to MemRef, generating a tensor_to_memref
// or bufferization pass is usually required here.
// For this snippet, we assume the inputs were already bufferized.
rewriter.replaceOp(op, resultAlloc);
return success();
}
};
To use the dialect in a tool like mlir-opt, it must be registered in the MLIRContext. The lowering pass is then added to a PassManager.
int main(int argc, char **argv) {
mlir::DialectRegistry registry;
// Register our custom dialect
registry.insert<mlir::tmath::TMathDialect>();
// Register standard dialects we lower to
registry.insert<mlir::affine::AffineDialect, mlir::memref::MemRefDialect>();
mlir::MLIRContext context(registry);
// Load a module, run the pass manager...
// ...
}
By completing this workflow, you have successfully extended the compiler's intermediate representation. The tmath dialect can now be used as a target from a frontend (like a Python parser) and can be optimized or lowered into standard dialects that eventually map to LLVM IR for CPU execution or SPIR-V for GPU execution. This extensibility is the mechanism that allows MLIR to support diverse hardware architectures without rewriting the entire compilation stack.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with