Chapter 3: Kernel Optimization and Loop Schedules

Graph-level transformations focus on reducing the overhead of launching operators and managing memory between them. However, the majority of execution time is spent within the operators themselves. These kernels are typically composed of dense nested loops performing linear algebra. A direct translation of mathematical logic into code often results in suboptimal performance due to poor cache utilization and unexploited instruction-level parallelism.

This chapter concentrates on the mechanics of loop optimization. We distinguish between the computation, what the program calculates, and the schedule, how the program executes. For example, a matrix multiplication might be defined mathematically as:

$C_{i,j} = \sum_{k} A_{i,k} \times B_{k,j}$

While the result $C_{i,j}$ remains constant, the order in which the processor traverses $i$ , $j$ , and $k$ drastically affects throughput. You will learn to manipulate these loop nests to align with the underlying hardware architecture.

We will examine specific scheduling primitives including:

Loop Tiling: Partitioning the iteration space into smaller blocks to maximize data reuse in the L1 and L2 cache.
Vectorization: Mapping scalar operations to SIMD (Single Instruction, Multiple Data) lanes to process multiple elements per cycle.
Reordering and Unrolling: modifying the loop structure to hide memory latency and reduce branch prediction overhead.

By applying these techniques, you will be able to transform a generic, slow implementation into a high-performance kernel. This process forms the engineering basis for how compilers map abstract tensor operations to concrete machine code.

Sections

3.1 The Loop Optimization Space
3.2 Loop Tiling and Blocking
3.3 Vectorization and SIMD
3.4 Loop Reordering and Unrolling
3.5 Hands-on Practical: Optimizing Matrix Multiplication