Optimizing a tensor program is effectively a search problem within a high-dimensional space of valid program transformations. When a deep learning framework hands a computational graph to a compiler, the mathematical definition of an operator, such as a convolution or matrix multiplication, is fixed. However, the specific sequence of instructions used to compute that result is flexible. This separation between the computation (what to calculate) and the schedule (how to calculate it) is the foundational principle of modern tensor compilers like TVM and Halide.The Loop Optimization Space represents the set of all possible schedules that preserve the functional correctness of the original program while altering its performance characteristics. Navigating this space requires balancing competing hardware constraints, primarily memory bandwidth, cache locality, and instruction-level parallelism.Separating Compute from ScheduleTo understand the optimization space, we first formalize the distinction between the algorithm and its execution. Consider a standard matrix multiplication $C = A \times B$ where $A \in \mathbb{R}^{M \times K}$ and $B \in \mathbb{R}^{K \times N}$.The logical definition implies a summation reduction:$$ C_{i,j} = \sum_{k=0}^{K-1} A_{i,k} \times B_{k,j} $$In a naive C++ implementation, this logic maps directly to a triply nested loop.for (int i = 0; i < M; ++i) { for (int j = 0; j < N; ++j) { for (int k = 0; k < K; ++k) { C[i * N + j] += A[i * K + k] * B[k * N + j]; } } }While semantically correct, this ordering ($i, j, k$) is merely one point in the optimization space. It dictates a specific traversal order of memory. If matrices $A$ and $B$ are large, this specific order may cause frequent cache misses for $B$ because the inner loop strides through $B$ in a non-contiguous manner (assuming row-major storage).A schedule defines transformations applied to these loops. We can permute the order to $i, k, j$, split loops into smaller chunks, or unroll them. Each combination constitutes a unique coordinate in the search space.Dimensions of the Search SpaceThe optimization space is constructed from composable scheduling primitives. We can categorize these primitives into three primary axes that define the shape and performance of the generated kernel.1. Iteration Order (Permutation)The sequence of nested loops determines the memory access pattern. For a loop nest of depth $D$, there are $D!$ possible permutations of loop indices, assuming no data dependencies prevent reordering. Changing the order from $i, j, k$ to $i, k, j$ changes the memory access for matrix $B$ from column-wise (strided) to row-wise (sequential), significantly improving spatial locality.2. Loop Structure (Tiling and Splitting)Tiling breaks a single loop into two nested loops: an outer loop that iterates over tiles and an inner loop that iterates within a tile. This transformation adds dimensions to the optimization space. A loop over $i$ from $0$ to $N$ can be split by a factor $T_{factor}$, creating an outer loop of range $\lceil N/T_{factor} \rceil$ and an inner loop of range $T_{factor}$.The choice of split factors is continuous (limited by the loop extent) and combinatorial. If we tile all three loops of a matrix multiplication, we introduce three new parameters to tune. These parameters must be chosen so that the working set of the inner loops fits into the L1 or L2 cache.3. Hardware Mapping (Binding)Once loops are structured, they must be mapped to hardware execution units. This includes:Vectorization: Marking an inner loop to be executed using SIMD instructions (e.g., AVX-512, NEON).Parallelization: Mapping outer loops to parallel threads (e.g., OpenMP threads on CPU, Thread Blocks on GPU).Unrolling: Fully expanding a loop at compile time to increase instruction density and allow the CPU pipeline to look ahead.The following diagram visualizes how a single loop definition expands into a tree of possible lowered implementations based on these decisions.digraph G { rankdir=LR; node [fontname="Arial", shape=box, style=filled, color="#dee2e6", fontcolor="#495057"]; edge [color="#adb5bd"]; Start [label="Logical Compute\n(i, j, k)", fillcolor="#e7f5ff", color="#74c0fc"]; Split [label="Split/Tile\n(i_out, i_in, ...)", fillcolor="#eebefa", color="#da77f2"]; Reorder [label="Reorder\n(i_out, k, j_out...)", fillcolor="#eebefa", color="#da77f2"]; Fuse [label="Operator Fusion", fillcolor="#d0bfff", color="#9775fa"]; Vectorize [label="Vectorize Inner", fillcolor="#b2f2bb", color="#51cf66"]; Unroll [label="Unroll", fillcolor="#b2f2bb", color="#51cf66"]; Bind [label="Thread Binding", fillcolor="#b2f2bb", color="#51cf66"]; Binary [label="Machine Code", fillcolor="#ffc9c9", color="#ff6b6b"]; Start -> Split; Split -> Reorder; Start -> Fuse; Fuse -> Split; Reorder -> Vectorize; Reorder -> Unroll; Reorder -> Bind; Vectorize -> Binary; Unroll -> Binary; Bind -> Binary; }The trajectory from abstract logic to machine code involves branching decisions at the structural (tiling), ordering, and hardware mapping levels.Constraints and DependenciesNot every point in the optimization space is valid. The compiler must respect data dependencies to preserve the program's correctness.If iteration $k+1$ depends on data produced in iteration $k$ (a Read-After-Write dependency), we cannot parallelize the $k$ loop, nor can we reorder it freely. Polyhedral analysis often represents these dependencies as a set of linear inequalities. A valid schedule preserves the partial order of events dictated by these dependencies.For deep learning, many operations like convolution and matrix multiplication are "embarrassingly parallel" along certain axes (like the batch or output channel dimensions) but are reductions along others (input channels). Identifying which axes allow for parallel mapping and which require synchronization is a primary task of the loop analysis phase.The Roofline Model and Performance TargetsWe search the optimization space to maximize throughput, usually measured in GFLOPS (Giga Floating Point Operations Per Second). The bounds of attainable performance are described by the Roofline Model, which relates arithmetic intensity to hardware limits.Arithmetic intensity is the ratio of floating-point operations performed to bytes of data moved from DRAM.$$ I = \frac{\text{FLOPs}}{\text{Bytes Transferred}} $$If a kernel has low arithmetic intensity, it is memory-bound; its performance is limited by memory bandwidth. Transformations like tiling aim to increase $I$ by keeping data in the cache, effectively preventing repeated fetches from DRAM. If a kernel has high arithmetic intensity, it is compute-bound; performance is limited by the processor's peak calculate capability. Transformations like vectorization and unrolling aim to saturate the compute units.{ "layout": { "title": "Roofline Model: Memory vs. Compute Bounds", "xaxis": { "title": "Arithmetic Intensity (FLOPs / Byte)", "type": "log", "range": [-1, 2.5], "showgrid": true, "gridcolor": "#dee2e6" }, "yaxis": { "title": "Performance (GFLOPS)", "type": "log", "range": [0, 3.5], "showgrid": true, "gridcolor": "#dee2e6" }, "plot_bgcolor": "white", "showlegend": true, "legend": { "x": 0.7, "y": 0.1 } }, "data": [ { "x": [0.1, 1, 10, 100], "y": [10, 100, 1000, 1000], "mode": "lines", "name": "Roofline Limit", "line": { "color": "#495057", "width": 3 } }, { "x": [0.25], "y": [25], "mode": "markers+text", "name": "Naive Loop", "text": ["Naive"], "textposition": "top left", "marker": { "color": "#fa5252", "size": 12 } }, { "x": [8], "y": [600], "mode": "markers+text", "name": "Optimized Loop", "text": ["Tiled & Vectorized"], "textposition": "bottom right", "marker": { "color": "#40c057", "size": 12 } } ] }A kernel's performance is clamped by either bandwidth (the sloped line) or peak compute (the horizontal line). Loop optimizations move the execution point from the memory-bound region (left, red) toward the compute-bound region (right, green).The Combinatorial ExplosionThe volume of the optimization space makes manual tuning impractical for every operator on every hardware backend.Consider a simple 3-level loop nest.Tiling: We can tile each loop level. If we choose tile sizes from ${2, 4, 8, 16, 32}$, that creates $5^3 = 125$ combinations.Reordering: There are $3! = 6$ orderings.Unrolling: We might unroll the inner loop by factors of ${0, 4, 16, 64}$.Even in this trivial example, we have thousands of potential schedules. In production networks with dozens of operators and deeper loop nests (e.g., 7-loop nests for convolutions with batches and channels), the space contains billions of candidates.Furthermore, the performance function is non-convex and discontinuous. A tile size of 32 might fit perfectly in the L1 cache, while a size of 33 causes conflict misses that degrade performance by an order of magnitude. This creates "cliffs" in the optimization space.Navigating this space effectively requires automated strategies. While we will explore auto-tuning algorithms (like Genetic Algorithms or Simulated Annealing) in Chapter 6, the goal of this chapter is to understand the mechanics of the transformations themselves. You must learn how to manually construct a valid, high-performance schedule to understand what the auto-tuners are searching for.