A compiler's Intermediate Representation (IR) describes exactly how to execute a computation, containing specific instructions for loop tiling, vectorization widths, and memory access patterns. However, this representation is still internal to the compiler framework. To run this logic on physical hardware, the compiler must translate these internal structures into a format that the target processor understands. This process is known as code generation, or "codegen."The code generation phase acts as the bridge between the abstract mathematical operations of a neural network and the concrete instruction sets of CPUs, GPUs, or accelerators. Rather than writing assembly code manually for every architecture, modern ML compilers leverage established backend frameworks to handle the final step of translation.From Schedules to Low-Level IRBefore generating binary code, the compiler performs a "lowering" phase. The high-level graph, which deals with tensors and operators, is lowered into a pointer-based representation often called Low-Level IR. In this state, the concept of a "Tensor" disappears. Instead, the compiler sees flat memory buffers, explicit loop indices, and primitive arithmetic operations.For example, a high-level matrix multiplication operation $C = A \times B$ is transformed into a triply nested loop structure. At this stage, the compiler injects the optimization decisions made during the auto-tuning phase. If the auto-tuner decided to unroll the inner loop by a factor of 4, the Low-Level IR explicitly reflects this by repeating the arithmetic instructions four times within the loop body.This separation allows the ML compiler to focus on loop transformations without worrying about register allocation or instruction scheduling, tasks which are delegated to downstream backends.digraph G { rankdir=TB; node [shape=box, style=filled, fontname="Arial", fontsize=12, margin=0.2]; edge [fontname="Arial", fontsize=10, color="#868e96"]; TunedIR [label="Tuned Schedule (High-Level)", fillcolor="#eebefa", color="#be4bdb"]; lowering [label="Lowering Pass", fillcolor="#e9ecef", color="#adb5bd", shape=ellipse]; LowLevelIR [label="Low-Level IR\n(Loops, Pointers, Primitives)", fillcolor="#a5d8ff", color="#228be6"]; subgraph cluster_backends { label="Target Backends"; fontname="Arial"; fontsize=12; style=dashed; color="#ced4da"; LLVM [label="LLVM IR Builder\n(for CPU)", fillcolor="#96f2d7", color="#12b886"]; SourceGen [label="Source Code Generator\n(CUDA/OpenCL for GPU)", fillcolor="#ffc9c9", color="#fa5252"]; } Binary [label="Machine Code / Binary", fillcolor="#d0bfff", color="#7950f2"]; TunedIR -> lowering; lowering -> LowLevelIR; LowLevelIR -> LLVM [label=" CPU Target"]; LowLevelIR -> SourceGen [label=" GPU Target"]; LLVM -> Binary; SourceGen -> Binary; }Flow of data from optimized schedule to executable machine code, distinguishing between LLVM-based and Source-based generation paths.The LLVM Path for CPUsFor CPU targets (x86, ARM, RISC-V), most ML compilers utilize LLVM (Low Level Virtual Machine). LLVM provides a standardized intermediate representation and a powerful suite of tools to compile that representation into efficient assembly.The ML compiler traverses its own Low-Level IR and constructs a corresponding LLVM module. This involves mapping primitives to LLVM instructions:Types: A 32-bit float in the ML compiler maps to float in LLVM.Control Flow: Loops are converted into basic blocks and branch instructions.Intrinsics: Vectorized operations are mapped to LLVM vector intrinsics, which the LLVM backend later translates to AVX-512 (Intel) or NEON (ARM) instructions.By targeting LLVM, the ML compiler gains access to decades of engineering work in general-purpose compiler optimization. LLVM handles register allocation, instruction scheduling, and dead-store elimination. The ML compiler simply needs to express the intent of the program correctly in LLVM IR.Consider a simplified example of how a vector addition might appear when lowered to LLVM IR. The syntax is verbose and typed, resembling assembly but with infinite virtual registers:$$ %res = \text{fadd } <4 \times \text{float}> %vecA, %vecB $$Here, the operation works on a vector of 4 floats simultaneously. The ML compiler generates this generic instruction, and LLVM decides which specific CPU instruction (e.g., vaddps) implements it best for the target processor.Source Generation for GPUsGenerating code for GPUs often follows a different path. While backends like NVVM (based on LLVM) exist for NVIDIA GPUs, many ML compilers use a technique called source-to-source compilation. Instead of generating a binary directly, the compiler generates a string of C-like source code.For NVIDIA GPUs, the compiler generates CUDA C code. For AMD GPUs or mobile processors, it might generate OpenCL or Vulkan compute shaders. This approach is pragmatic because GPU drivers invoke their own aggressive compilers that are highly tuned for the specific generation of the graphics card.The process generally involves:Kernel Extraction: Identifying the loop nests that will run on the GPU.Syntax Translation: converting the internal IR expressions into strings. For example, an internal Select(condition, true_val, false_val) node might be translated to the string condition ? true_val : false_val or a specialized intrinsic like fminf.Runtime Compilation: The generated source string is passed to a runtime compiler driver (like NVRTC for CUDA) during program execution. This compiles the string into PTX (Parallel Thread Execution) code or binary cubin files.This method allows the ML compiler to inspect the generated source code easily, which is useful for debugging performance issues. If the auto-tuner selects a block size of 128, you will clearly see __launch_bounds__(128) or similar directives in the generated C code.Handling the Host-Device SplitCode generation is rarely about generating a single block of code. It usually requires generating two distinct parts: the host code and the device code.The Device Code is the compute-heavy kernel (e.g., the matrix multiplication) that runs on the accelerator. The Host Code runs on the CPU and is responsible for:Allocating memory on the device.Moving data from the CPU RAM to the device memory.Configuring the function arguments (pointers, shapes).Launching the kernel.When the ML compiler generates code, it packages these two components together. The host code is typically compiled into a shared library (like a .so or .dll), while the device code is embedded within that library as a binary blob or a string.{"layout": {"title": {"text": "Execution Time Breakdown (Illustrative)", "font": {"size": 16}}, "xaxis": {"showgrid": false, "zeroline": false, "visible": false}, "yaxis": {"showgrid": false, "title": "Stages"}, "barmode": "stack", "height": 250, "margin": {"l": 100, "r": 20, "t": 40, "b": 20}, "showlegend": true}, "data": [{"y": ["Execution"], "x": [15], "name": "Argument Packing", "orientation": "h", "type": "bar", "marker": {"color": "#adb5bd"}}, {"y": ["Execution"], "x": [10], "name": "Kernel Launch Overhead", "orientation": "h", "type": "bar", "marker": {"color": "#ffc9c9"}}, {"y": ["Execution"], "x": [75], "name": "Device Execution (Generated Code)", "orientation": "h", "type": "bar", "marker": {"color": "#4dabf7"}}]}A breakdown of where time is spent during a function call. The code generator must optimize the device execution, but also minimize the overhead of argument packing and launching managed by the host code.JIT vs. AOT FinalizationThe final output of the code generation backend depends on the deployment strategy.In an Ahead-of-Time (AOT) scenario, the backend writes the generated machine code to disk as a shared object file. This library can then be linked into a C++ application or a mobile app, allowing the model to run without the ML compiler framework being present. This is ideal for edge devices where memory and storage are limited.In a Just-in-Time (JIT) scenario, common in Python experimentation, the backend generates the code in memory. It allocates an executable memory page, writes the machine instructions directly into it, and returns a function pointer. Python then invokes this function pointer via ctypes or a similar foreign function interface. This allows for immediate feedback during the auto-tuning process, as the compiler can generate, run, and measure a schedule in a tight loop.By mastering these backends, you gain the ability to see exactly what the hardware executes. Instead of treating the matmul function as a black box, you can inspect the LLVM IR or CUDA source to verify that vectorization is active and memory access patterns are streamlined.