An ML compiler is not a single monolithic program. Instead, it functions as a modular pipeline designed to progressively lower high-level mathematical descriptions into low-level machine code. While standard compilers like GCC or Clang transform C++ into assembly, ML compilers transform computation graphs (like a ResNet or Transformer model) into optimized kernel binaries.To manage the complexity of mapping linear algebra to diverse hardware backends, ranging from CPUs and GPUs to TPUs and FPGAs, ML compilers adopt a layered architecture. This design separates the problem of model representation from the problem of hardware execution.We generally categorize the compiler stack into three primary stages: the Frontend, the Middle-end (Optimizer), and the Backend.digraph G { rankdir=TB; node [fontname="Helvetica", shape=box, style=filled, color="#dee2e6"]; edge [fontname="Helvetica", color="#868e96"]; subgraph cluster_0 { label = "ML Framework"; style=dashed; color="#adb5bd"; Framework [label="PyTorch / TensorFlow\nDefinition", fillcolor="#e9ecef"]; } subgraph cluster_1 { label = "ML Compiler Stack"; color="#ced4da"; style=solid; Frontend [label="Frontend\n(Importer & Parser)", fillcolor="#74c0fc"]; HighIR [label="High-Level IR\n(Computation Graph)", fillcolor="#bac8ff"]; Optimizer [label="Optimizer\n(Passes & Tiling)", fillcolor="#74c0fc"]; LowIR [label="Low-Level IR\n(Loops & Indices)", fillcolor="#bac8ff"]; Backend [label="Backend\n(CodeGen)", fillcolor="#74c0fc"]; } subgraph cluster_2 { label = "Hardware"; style=dashed; color="#adb5bd"; Binary [label="Executable Binary\n(ELF / CUBIN)", fillcolor="#e9ecef"]; } Framework -> Frontend; Frontend -> HighIR; HighIR -> Optimizer; Optimizer -> LowIR; LowIR -> Backend; Backend -> Binary; }Data flow through the primary components of a machine learning compiler stackThe Frontend: Ingestion and TranslationThe frontend acts as the interface between the user-facing deep learning framework and the compiler internals. Its primary responsibility is to ingest the model definition and translate it into a format the compiler can manipulate. This process is often called "lowering."When you run a PyTorch model, the framework maintains a dynamic graph of operations. The frontend captures this graph using one of two methods:Tracing: Running the model with dummy data and recording the operations executed.Scripting/Parsing: Analyzing the Python Abstract Syntax Tree (AST) to build a static representation of the control flow.Once captured, the frontend maps framework-specific operators (like torch.nn.Conv2d) to the compiler's own High-Level Intermediate Representation (IR). This IR is usually a Directed Acyclic Graph (DAG) where nodes represent mathematical tensor operations and edges represent data dependencies.The Middle-End: The Optimization EngineThe middle-end is where the compiler applies transformations to improve performance without changing the mathematical result. This stage is platform-independent or weakly platform-dependent. It operates on the High-Level IR.Optimizations here fall into two categories:Graph-Level Optimizations These modifications change the structure of the computation graph. A classic example is operator fusion. If the graph contains a matrix multiplication followed immediately by a ReLU activation:$$Y = \text{ReLU}(X \times W + b)$$The compiler fuses these into a single kernel, avoiding the need to write the intermediate result of $(X \times W + b)$ back to main memory before reading it again for the ReLU. We will examine this extensively in Chapter 3.Tensor-Level Optimizations Once graph optimizations are complete, the compiler "lowers" the graph further into a Tensor or Loop IR. At this stage, high-level operations like "Convolution" are broken down into explicit nested loops. The optimizer then applies techniques such as:Loop Tiling: Breaking large loops into smaller blocks to fit into the cache.Vectorization: converting scalar operations into SIMD (Single Instruction, Multiple Data) instructions.The Backend: Code GenerationThe backend is responsible for mapping the optimized IR to the specific instruction set architecture (ISA) of the target hardware. This is the only stage that must be heavily customized for the specific device (e.g., an NVIDIA H100 GPU versus an ARM Cortex CPU).The backend performs two critical tasks:Resource Allocation: It determines how to assign specific hardware registers and shared memory banks to the variables in the IR.Code Emission: It translates the IR into the final machine code.Many modern ML compilers, such as TVM or XLA, do not generate binary machine code directly. Instead, they generate code for a lower-level compiler framework like LLVM (for CPUs) or NVVM (for NVIDIA GPUs). This allows the ML compiler to focus on tensor-specific optimizations while relying on established tools like LLVM to handle instruction scheduling and register allocation.The Role of Multi-Level IRStandard software compilers often use a single primary IR (like LLVM IR). However, the abstraction gap in machine learning is too wide to be bridged by a single format. A representation that is good for graph fusion is often terrible for loop scheduling.Therefore, the anatomy of a modern ML compiler is defined by Multi-Level Intermediate Representation. The code flows through a cascade of dialects, starting with high-level abstractions (Tensors, Graphs) and ending with low-level abstractions (Pointers, Memory Offsets).{"layout": {"title": "Abstraction Levels in ML Compilation", "xaxis": {"title": "Abstraction Level", "showticklabels": false}, "yaxis": {"title": "Information Density", "showticklabels": false}, "height": 400, "width": 600, "showlegend": true}, "data": [{"x": [1, 2, 3], "y": [90, 50, 10], "mode": "lines+markers", "name": "Domain Specific Info (e.g. Convolutions)", "line": {"color": "#74c0fc", "width": 4}}, {"x": [1, 2, 3], "y": [10, 50, 90], "mode": "lines+markers", "name": "Hardware Specific Info (e.g. Registers)", "line": {"color": "#fa5252", "width": 4}}]}As the compiler lowers the code (moving right on the X-axis), domain-specific context decreases while hardware-specific details increase.In the next section, we will contrast how this compilation pipeline is triggered, specifically comparing Ahead-of-Time (AOT) compilation against Just-in-Time (JIT) execution.