Chapter 5: Auto-Tuning and Code Generation

Manual optimization of loop nests requires significant engineering effort. A developer must select specific tile sizes, vector lengths, and unrolling factors to maximize hardware utilization. However, the optimal configuration changes depending on the target hardware architecture. A schedule that performs well on a server-grade GPU often yields poor results on a mobile CPU or an edge accelerator. To address this, modern machine learning compilers employ auto-tuning strategies that treat optimization as a search problem.

This chapter examines how to automate the selection of efficient loop schedules. We begin by defining the search space, which represents the set of all valid code transformations for a given operator. If an operator involves a loop of size $N$ , and we want to split it into tiles of size $T$ , the search space includes all valid divisors of $N$ . You will look at how cost models evaluate these candidates to estimate performance without requiring a full execution of every variation on the physical hardware.

The discussion moves to the specific algorithms used to navigate these search spaces. We will assess how compilers use statistical methods and machine learning models to predict the performance of a schedule, effectively guiding the search toward optimal parameters more efficiently than random sampling.

Finally, we cover the code generation phase. Once an optimal schedule is selected, the compiler must translate the high-level Intermediate Representation (IR) into executable instructions. We will analyze how the backend lowers the IR into low-level formats, such as LLVM IR for CPU targets or CUDA source code for NVIDIA GPUs. By the end of this section, you will be able to configure an automated tuning session and inspect the resulting machine code generation pipeline.

Sections

5.1 Defining the Search Space
5.2 Cost Models in Auto-Tuning
5.3 Automated Schedule Search
5.4 Code Generation Backends
5.5 Running an Auto-Tuning Session