General-purpose compilers have relied on Intermediate Representations (IRs) like LLVM IR, Java Bytecode, or GCC's GIMPLE for decades. These IRs are masterfully designed for optimizing traditional software written in languages like C++, Java, or Fortran. They excel at representing scalar operations, pointer arithmetic, complex control flow, and function calls. However, when confronted with the unique characteristics of machine learning workloads, these well-established IRs reveal significant limitations. The core issue lies in a fundamental mismatch between the abstractions needed for ML optimization and the level at which traditional IRs operate.
Let's examine the specific shortcomings:
ML models are typically expressed as computation graphs where nodes represent high-level operations like convolution, matrix multiplication, pooling, or activation functions, operating on multi-dimensional tensors. Traditional compiler IRs operate at a much lower level, typically representing computations as a control-flow graph (CFG) of basic blocks containing scalar instructions or, at best, vector instructions.
Consider a simple sequence like Convolution -> ReLU -> Pooling. In an ML framework's graph representation, this is explicit.
A high-level representation of a common ML operator sequence.
When lowered prematurely into a traditional IR like LLVM IR, this structure dissolves. The Conv2D
operation becomes a complex nest of loops implementing the convolution algorithm. The ReLU
might become a loop with conditional assignments, and MaxPool
another set of loops. The IR primarily sees sequences of loads, stores, floating-point multiplications, additions, and comparisons.
A highly simplified view of how the high-level operations might appear after lowering into a traditional, scalar-focused IR. The original graph structure and operator semantics are obscured.
This immediate loss of high-level semantics is detrimental. Optimizations like operator fusion (combining Conv2D
and ReLU
into a single efficient kernel) become incredibly difficult. The compiler would need sophisticated loop analysis and pattern matching to try and reconstruct the original high-level intent from the sea of low-level instructions, a task that is complex and often fails.
ML revolves around operations on tensors, which are multi-dimensional arrays. Traditional IRs typically lack first-class support for tensor types. Representing a tensor often requires managing pointers and explicit dimension information separately. More importantly, operations on these tensors (like matmul
or conv2d
) are not primitive operations in the IR. They must be immediately decomposed into loops and scalar arithmetic. This prevents the compiler from reasoning about the properties of the tensor operation itself. For example, knowing an operation is a matrix multiplication allows applying specialized library calls (like BLAS) or hardware instructions (like Tensor Core operations), but this information is lost if the IR only sees loops.
The physical layout of tensor data in memory (e.g., NCHW vs. NHWC for images) significantly impacts performance, especially on hardware like GPUs or specialized accelerators. Optimizing layouts often involves transforming the data between operations or choosing layouts based on hardware characteristics. Traditional IRs generally don't provide mechanisms to represent or manipulate data layouts as a first-class concept tied to the tensor type or operation. Layout transformations, if possible at all, often require complex rewrites of the low-level loop structures, making automatic layout optimization extremely challenging.
Modern ML workloads run on a diverse set of hardware: multi-core CPUs with vector units, GPUs with complex memory hierarchies and parallel execution models, and specialized accelerators like TPUs, NPUs, or FPGAs, each with unique instruction sets and memory architectures. While traditional IRs like LLVM have backends for CPUs and GPUs, the IR itself might not be the most suitable abstraction layer. Representing computations for a systolic array on a TPU, for instance, requires different abstractions than representing code for a SIMT architecture on a GPU. A traditional, lower-level IR might force decisions too early or lack the constructs needed to effectively map computations to these diverse hardware targets without losing optimization potential.
ML compiler optimizations often involve techniques specific to the domain, such as algebraic simplifications based on mathematical properties of operators (e.g., simplifying transpose(transpose(A))
to A
) or exploiting properties like sparsity. Traditional IRs are designed for general-purpose code and lack the structure or semantics to easily represent or apply these domain-specific rules at the appropriate level. Implementing such optimizations requires complex pattern matching on low-level code sequences, hindering the development and effectiveness of ML-specific compiler passes.
Optimizing an entire ML model requires a global view of the computation graph. This allows for optimizations like static memory planning (allocating memory for the entire graph execution upfront) or identifying large-scale parallelization opportunities. When the model is immediately lowered to a traditional IR's CFG representation, this global graph structure is fragmented, making such high-level, graph-wide analyses and transformations much harder to perform.
These limitations collectively demonstrate that while traditional compiler IRs are powerful tools for general software, they present an impedance mismatch for the demands of optimizing complex, high-level ML models. The need to preserve high-level semantics, explicitly handle tensor operations and layouts, target diverse hardware effectively, and facilitate domain-specific graph optimizations drives the development and adoption of specialized, multi-level intermediate representations, which we will explore next, starting with MLIR.
© 2025 ApX Machine Learning