Modern hardware accelerators perform arithmetic operations orders of magnitude faster than they can fetch data from main memory. When optimizing machine learning kernels, specifically dense operations like matrix multiplication or convolution, the primary bottleneck is rarely the raw computation speed. Instead, performance is limited by memory bandwidth. To address this, we must restructure our loops to prioritize data reuse, a strategy centered on cache locality.The Memory Hierarchy ChallengeProcessors utilize a hierarchy of memory to mitigate the latency of fetching data from DRAM (Dynamic Random Access Memory). Closest to the arithmetic units are the registers and L1 cache, followed by L2 and L3 caches. Accessing data from the L1 cache takes only a few clock cycles, whereas fetching data from DRAM can take hundreds.The goal of loop optimization is to keep the active working set of data within the fastest possible cache level for as long as possible. This relies on two principles:Temporal Locality: If a data location is referenced, it will likely be referenced again soon.Spatial Locality: If a data location is referenced, data locations with nearby addresses will likely be referenced soon.Deep learning operators often exhibit high reuse potential, but naive loop implementations fail to exploit it.Analyzing Naive Matrix MultiplicationConsider a standard matrix multiplication $C = A \times B$, where all matrices are of size $N \times N$. The mathematical definition is:$$C_{i,j} = \sum_{k=0}^{N-1} A_{i,k} \times B_{k,j}$$In a typical row-major layout (common in C++ and PyTorch), consecutive elements of a row are stored contiguously in memory. A naive implementation using three nested loops looks like this:for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { for (int k = 0; k < N; k++) { C[i][j] += A[i][k] * B[k][j]; } } }While this code is mathematically correct, the memory access pattern for matrix $B$ is inefficient. As the inner loop increments k, we access B[0][j], B[1][j], B[2][j], and so on. In a row-major layout, these elements are stored $N$ positions apart. If $N$ is large, every access to $B$ jumps across memory, likely causing a cache miss. By the time the loop returns to a previously loaded cache line, that line may have been evicted to make room for other data. This creates a scenario known as cache thrashing.Loop Tiling MechanicsLoop tiling, also known as blocking, addresses this by partitioning the iteration space into smaller chunks (tiles) that fit entirely inside the cache. Instead of iterating over the entire range of $i, j, and k$, we iterate over small blocks of the matrix.This transformation requires splitting the original loops into "outer" loops (which iterate over the tiles) and "inner" loops (which iterate within a tile).Here is the structural change for the matrix multiplication example. We introduce a tile size factor, $T_{size}$.// Outer loops iterate over tiles for (int i_outer = 0; i_outer < N; i_outer += T_size) { for (int j_outer = 0; j_outer < N; j_outer += T_size) { for (int k_outer = 0; k_outer < N; k_outer += T_size) { // Inner loops iterate within the current tile for (int i_inner = i_outer; i_inner < i_outer + T_size; i_inner++) { for (int j_inner = j_outer; j_inner < j_outer + T_size; j_inner++) { for (int k_inner = k_outer; k_inner < k_outer + T_size; k_inner++) { C[i_inner][j_inner] += A[i_inner][k_inner] * B[k_inner][j_inner]; } } } } } }In this tiled version, the processor loads a small $T_{size} \times T_{size}$ block of matrix $A$ and matrix $B$ into the cache. The inner loops perform all necessary calculations involving these blocks before moving to the next set. Because the blocks are small enough to fit in the cache, we eliminate the repeated expensive trips to main memory.The following diagram illustrates how the matrix is logically divided into smaller processing blocks.{"layout": {"width": 600, "height": 600, "title": "Matrix Tiling Visualization", "xaxis": {"title": "Column Index (j)", "showgrid": false, "zeroline": false}, "yaxis": {"title": "Row Index (i)", "showgrid": false, "zeroline": false, "autorange": "reversed"}, "shapes": [{"type": "rect", "x0": 0, "y0": 0, "x1": 16, "y1": 16, "line": {"color": "#adb5bd", "width": 2}}, {"type": "rect", "x0": 4, "y0": 4, "x1": 8, "y1": 8, "fillcolor": "#4dabf7", "opacity": 0.5, "line": {"color": "#1c7ed6", "width": 3}}]}, "data": [{"x": [2, 6, 10, 14, 2, 6, 10, 14, 2, 6, 10, 14, 2, 6, 10, 14], "y": [2, 2, 2, 2, 6, 6, 6, 6, 10, 10, 10, 10, 14, 14, 14, 14], "mode": "text", "text": ["Tile (0,0)", "Tile (0,1)", "Tile (0,2)", "Tile (0,3)", "Tile (1,0)", "Active Tile", "Tile (1,2)", "Tile (1,3)", "Tile (2,0)", "Tile (2,1)", "Tile (2,2)", "Tile (2,3)", "Tile (3,0)", "Tile (3,1)", "Tile (3,2)", "Tile (3,3)"], "type": "scatter"}]}The matrix is divided into 4x4 distinct regions. The highlighted blue region represents the "Active Tile" currently loaded into the cache, allowing local computations to complete before data eviction.Choosing the Tile SizeSelecting the correct $T_{size}$ is critical and hardware-dependent.Too Small: The loop overhead (incrementing counters, branching) dominates the execution time. You fail to fully utilize the available cache lines.Too Large: The data required for the inner loops exceeds the cache size, causing evictions and reverting the performance to that of the naive implementation.In automated ML compilation stacks like TVM or MLIR, finding the optimal tile size is often handled by the auto-tuning component. The tuner empirically tests distinct tile configurations (e.g., $16 \times 16$, $32 \times 32$, $64 \times 64$) to find the sweet spot for the specific L1/L2 cache sizes of the target hardware.Multi-Level TilingModern CPUs and GPUs have multiple levels of cache. To maximize performance, compilers often apply tiling hierarchically. We might use a larger tile size that fits into the L2 cache, and then subdivide that into smaller tiles that fit into the L1 cache or register file.The transformation graph below depicts the flow from the main memory to the computation units, illustrating why distinct tile sizes map to different hardware memory levels.digraph G { rankdir=TB; node [shape=box, style=filled, fillcolor="#e9ecef", fontname="Helvetica"]; edge [fontname="Helvetica", color="#868e96"]; DRAM [label="DRAM (Main Memory)\nSlow, High Capacity", fillcolor="#dee2e6", height=1]; L2 [label="L2 Cache\nMedium Speed\nLarge Tile Size", fillcolor="#a5d8ff"]; L1 [label="L1 Cache\nFast\nSmall Tile Size", fillcolor="#74c0fc"]; Reg [label="Registers\nInstant Access\nScalar/Vector Values", fillcolor="#4dabf7"]; ALU [label="ALU / FPU\nComputation", shape=ellipse, fillcolor="#ffc9c9"]; DRAM -> L2 [label=" Block Load"]; L2 -> L1 [label=" Sub-Block Load"]; L1 -> Reg [label=" Value Load"]; Reg -> ALU [label=" Execute"]; }Data movement hierarchy showing how tiling strategies align with hardware memory levels. Larger tiles target L2 cache, while sub-tiles target L1 and registers.Impact on Arithmetic IntensityThe ultimate metric we are trying to improve is arithmetic intensity, the ratio of floating-point operations (FLOPs) performed to bytes of memory transferred.In the naive matrix multiplication, we perform 2 operations (multiply and add) for every few memory accesses. With tiling, we load a block of size $T_{size} \times T_{size}$ (which requires transferring $2 \times T_{size}^2$ elements for A and B) and perform $T_{size}^3$ operations on that data. As $T_{size}$ increases (up to the cache limit), the ratio of computation to memory access improves significantly.By ensuring that the arithmetic units are fed data constantly from the cache rather than waiting on main memory, loop tiling transforms a memory-bound operation into a compute-bound operation, allowing the hardware to approach its theoretical peak performance. In the next section, we will combine tiling with vectorization to further exploit the capabilities of modern processors.