Graph-level transformations focus on reducing the overhead of launching operators and managing memory between them. However, the majority of execution time is spent within the operators themselves. These kernels are typically composed of dense nested loops performing linear algebra. A direct translation of mathematical logic into code often results in suboptimal performance due to poor cache utilization and unexploited instruction-level parallelism.
This chapter concentrates on the mechanics of loop optimization. We distinguish between the computation, what the program calculates, and the schedule, how the program executes. For example, a matrix multiplication might be defined mathematically as:
While the result remains constant, the order in which the processor traverses , , and drastically affects throughput. You will learn to manipulate these loop nests to align with the underlying hardware architecture.
We will examine specific scheduling primitives including:
By applying these techniques, you will be able to transform a generic, slow implementation into a high-performance kernel. This process forms the engineering basis for how compilers map abstract tensor operations to concrete machine code.
3.1 The Loop Optimization Space
3.2 Loop Tiling and Blocking
3.3 Vectorization and SIMD
3.4 Loop Reordering and Unrolling
3.5 Hands-on Practical: Optimizing Matrix Multiplication