Chapter 4: Kernel and Loop Optimization

Graph-level transformations arrange the flow of data between operators, but they do not dictate how individual operations execute on the hardware. Even with an optimized graph structure, the performance of a neural network depends heavily on the efficiency of the code inside the kernels. This chapter shifts focus from the high-level graph to the low-level loop nests that perform the actual computation.

A straightforward implementation of a tensor operation often utilizes only a fraction of a processor's theoretical peak performance. For instance, a naive matrix multiplication loop defined as $C_{ij} = \sum_k A_{ik} B_{kj}$ generally suffers from poor memory access patterns. On modern hardware, the bottleneck is frequently the latency of moving data from main memory to the registers rather than the speed of the arithmetic logic units (ALUs).

In this module, we examine how compilers rewrite loop structures to maximize arithmetic intensity. You will learn to manipulate the iteration space of a program to improve cache locality and instruction parallelism. The curriculum covers specific transformation techniques including:

Loop Tiling: Partitioning large loops into smaller blocks that fit within the CPU or GPU cache hierarchy.
Vectorization: transforming scalar operations into vector operations to utilize SIMD (Single Instruction, Multiple Data) instructions.
Unrolling and Reordering: Reducing control flow overhead and exposing opportunities for the processor to execute instructions simultaneously.

We will also discuss strategies for hiding memory latency by overlapping data transfer with computation. By the end of this chapter, you will understand how to inspect a loop nest and apply a sequence of scheduling primitives to optimize a matrix multiplication kernel for a specific hardware target.

Sections

4.1 Loop Tiling and Cache Locality
4.2 Vectorization for SIMD
4.3 Loop Unrolling and Reordering
4.4 Parallelization Strategies
4.5 Memory Latency Hiding
4.6 Matrix Multiplication Practice