Efficient kernel execution relies on aligning software instructions with the physical realities of hardware. While loop tiling optimizes for cache capacity, it does not inherently address how data is retrieved from a cache line or how efficiently the CPU pipeline is utilized. Loop reordering and unrolling are transformations that target spatial locality and instruction-level parallelism (ILP).
These transformations change the execution order and structure without altering the semantic result. For a deep learning compiler, the objective is to permute the iteration space so that memory accesses are contiguous and the instruction pipeline remains saturated.
To understand why loop order matters, we must look at how multidimensional tensors are flattened into linear memory. In C/C++ and most deep learning frameworks (like PyTorch or TVM default), tensors are stored in row-major order. This means the last dimension changes the fastest in memory.
Consider a tensor of shape . The element is stored at address base + i*M + j.
The following diagram illustrates the difference in memory traversal between stride-1 access and strided access.
Memory access patterns dictate cache efficiency. Stride-1 access utilizes the full cache line, whereas strided access effectively wastes memory bandwidth.
The classic example of loop reordering is Matrix Multiplication (MatMul). The mathematical definition involves three indices: , , and .
A naive implementation typically uses the order:
for (int i = 0; i < N; i++) {
for (int j = 0; j < M; j++) {
for (int k = 0; k < K; k++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
In this configuration:
Because is accessed column-wise while stored row-wise, this schedule suffers from severe cache thrashing. By changing the loop order to , we alter the access pattern:
for (int i = 0; i < N; i++) {
for (int k = 0; k < K; k++) {
for (int j = 0; j < M; j++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
In the permutation:
This simple reordering can result in significant speedups, often 5x to 10x on standard CPUs, merely by respecting the hardware's preference for linear access. Compilers model this using Polyhedral analysis to determine dependency-free permutations that maximize spatial locality.
While reordering optimizes data movement, loop unrolling optimizes the instruction stream. Unrolling involves replicating the loop body multiple times within a single iteration and decreasing the loop counter frequency.
Consider a standard dot product loop:
for (int i = 0; i < N; i++) {
sum += A[i] * B[i];
}
A CPU executing this code encounters overhead: incrementing i, comparing i to N, and a branch instruction. Additionally, scalar operations often fail to fill the processor's multiple execution units.
An unrolled version might look like this:
for (int i = 0; i < N; i += 4) {
sum += A[i] * B[i];
sum += A[i+1] * B[i+1];
sum += A[i+2] * B[i+2];
sum += A[i+3] * B[i+3];
}
Unrolling provides three specific advantages:
Unrolling is not a monotonic optimization where "more is better." It introduces a trade-off between pipeline saturation and register pressure.
The following chart demonstrates the performance characteristics of different unroll factors. Notice the diminishing returns and eventual regression as register pressure increases.
Normalized throughput increases with the unroll factor as loop overhead diminishes and ILP improves. However, past a factor of 8, performance degrades due to register spilling and instruction cache misses.
In frameworks like TVM or MLIR, these transformations are applied during the lowering phase. The compiler does not guess randomly; it uses cost models (discussed in Chapter 6) to select parameters.
A typical transformation recipe for a dense kernel involves a composite strategy:
By combining reordering to fix memory access patterns and unrolling to saturate the instruction pipeline, the compiler generates a kernel that approaches the theoretical peak performance of the hardware.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•