A compiler's Intermediate Representation (IR) describes exactly how to execute a computation, containing specific instructions for loop tiling, vectorization widths, and memory access patterns. However, this representation is still internal to the compiler framework. To run this logic on physical hardware, the compiler must translate these internal structures into a format that the target processor understands. This process is known as code generation, or "codegen."
The code generation phase acts as the bridge between the abstract mathematical operations of a neural network and the concrete instruction sets of CPUs, GPUs, or accelerators. Rather than writing assembly code manually for every architecture, modern ML compilers leverage established backend frameworks to handle the final step of translation.
Before generating binary code, the compiler performs a "lowering" phase. The high-level graph, which deals with tensors and operators, is lowered into a pointer-based representation often called Low-Level IR. In this state, the concept of a "Tensor" disappears. Instead, the compiler sees flat memory buffers, explicit loop indices, and primitive arithmetic operations.
For example, a high-level matrix multiplication operation C=A×B is transformed into a triply nested loop structure. At this stage, the compiler injects the optimization decisions made during the auto-tuning phase. If the auto-tuner decided to unroll the inner loop by a factor of 4, the Low-Level IR explicitly reflects this by repeating the arithmetic instructions four times within the loop body.
This separation allows the ML compiler to focus on loop transformations without worrying about register allocation or instruction scheduling, tasks which are delegated to downstream backends.
Flow of data from optimized schedule to executable machine code, distinguishing between LLVM-based and Source-based generation paths.
For CPU targets (x86, ARM, RISC-V), most ML compilers utilize LLVM (Low Level Virtual Machine). LLVM provides a standardized intermediate representation and a powerful suite of tools to compile that representation into efficient assembly.
The ML compiler traverses its own Low-Level IR and constructs a corresponding LLVM module. This involves mapping primitives to LLVM instructions:
float in LLVM.By targeting LLVM, the ML compiler gains access to decades of engineering work in general-purpose compiler optimization. LLVM handles register allocation, instruction scheduling, and dead-store elimination. The ML compiler simply needs to express the intent of the program correctly in LLVM IR.
Consider a simplified example of how a vector addition might appear when lowered to LLVM IR. The syntax is verbose and typed, resembling assembly but with infinite virtual registers:
%res=fadd <4×float>%vecA,%vecBHere, the operation works on a vector of 4 floats simultaneously. The ML compiler generates this generic instruction, and LLVM decides which specific CPU instruction (e.g., vaddps) implements it best for the target processor.
Generating code for GPUs often follows a different path. While backends like NVVM (based on LLVM) exist for NVIDIA GPUs, many ML compilers use a technique called source-to-source compilation. Instead of generating a binary directly, the compiler generates a string of C-like source code.
For NVIDIA GPUs, the compiler generates CUDA C code. For AMD GPUs or mobile processors, it might generate OpenCL or Vulkan compute shaders. This approach is pragmatic because GPU drivers invoke their own aggressive compilers that are highly tuned for the specific generation of the graphics card.
The process generally involves:
Select(condition, true_val, false_val) node might be translated to the string condition ? true_val : false_val or a specialized intrinsic like fminf.This method allows the ML compiler to inspect the generated source code easily, which is useful for debugging performance issues. If the auto-tuner selects a block size of 128, you will clearly see __launch_bounds__(128) or similar directives in the generated C code.
Code generation is rarely about generating a single block of code. It usually requires generating two distinct parts: the host code and the device code.
The Device Code is the compute-heavy kernel (e.g., the matrix multiplication) that runs on the accelerator. The Host Code runs on the CPU and is responsible for:
When the ML compiler generates code, it packages these two components together. The host code is typically compiled into a shared library (like a .so or .dll), while the device code is embedded within that library as a binary blob or a string.
A breakdown of where time is spent during a function call. The code generator must optimize the device execution, but also minimize the overhead of argument packing and launching managed by the host code.
The final output of the code generation backend depends on the deployment strategy.
In an Ahead-of-Time (AOT) scenario, the backend writes the generated machine code to disk as a shared object file. This library can then be linked into a C++ application or a mobile app, allowing the model to run without the ML compiler framework being present. This is ideal for edge devices where memory and storage are limited.
In a Just-in-Time (JIT) scenario, common in Python experimentation, the backend generates the code in memory. It allocates an executable memory page, writes the machine instructions directly into it, and returns a function pointer. Python then invokes this function pointer via ctypes or a similar foreign function interface. This allows for immediate feedback during the auto-tuning process, as the compiler can generate, run, and measure a schedule in a tight loop.
By mastering these backends, you gain the ability to see exactly what the hardware executes. Instead of treating the matmul function as a black box, you can inspect the LLVM IR or CUDA source to verify that vectorization is active and memory access patterns are streamlined.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with