Modern deep learning accelerators provide massive arithmetic throughput. However, achieving peak utilization is rarely limited by the speed of the floating-point units. Instead, the bottleneck is almost always the rate at which data moves from memory to the processor. The cost of fetching operands from DRAM can be orders of magnitude higher than the cost of performing the multiplication itself. This discrepancy is commonly referred to as the memory wall.To mitigate this latency, hardware architects employ a hierarchy of caches (L1, L2, L3). Loop tiling, also known as loop blocking, is the primary software transformation used to align software execution with this hardware hierarchy. It partitions the iteration space of a loop nest into smaller blocks that fit into a specific level of fast memory. This ensures that once a data chunk is loaded, it is reused as many times as possible before being evicted.The Mechanics of LocalityConsider a standard dense matrix multiplication $C = A \times B$, where all matrices are of size $N \times N$. A naive implementation typically consists of three nested loops:$$ \text{for } i \in [0, N): \text{ for } j \in [0, N): \text{ for } k \in [0, N): C_{i,j} \ += A_{i,k} \times B_{k,j} $$In a row-major storage format, the access patterns reveal a performance hazard:Matrix A ($A_{i,k}$): Accessed sequentially along the row. This exhibits good spatial locality.Matrix B ($B_{k,j}$): Accessed with a stride of $N$. For every iteration of the inner $k$ loop, the memory address jumps by $N \times \text{sizeof(float)}$. This breaks spatial locality and often results in a cache miss for every load if $N$ is large.Matrix C ($C_{i,j}$): Accessed constantly within the inner loop, exhibiting excellent temporal locality.When $N$ is sufficiently large, the working set size exceeds the cache capacity. By the time the program finishes one row of $B$ and needs to reuse elements of $A$, the cache lines containing $A$ have likely been evicted.Transformation to Tiled LoopsTiling addresses this by introducing new loops to traverse the data in blocks. We split the original iteration space $(i, j, k)$ into outer loops $(i_{outer}, j_{outer}, k_{outer})$ that step over tiles, and inner loops $(i_{inner}, j_{inner}, k_{inner})$ that iterate within a tile.If we choose a tile size of $T$, the transformed logic looks like this:// Stepping through tiles for (int i_o = 0; i_o < N; i_o += T) { for (int j_o = 0; j_o < N; j_o += T) { for (int k_o = 0; k_o < N; k_o += T) { // Executing the tile (fits in cache) for (int i_i = i_o; i_i < min(i_o + T, N); i_i++) { for (int j_i = j_o; j_i < min(j_o + T, N); j_i++) { for (int k_i = k_o; k_i < min(k_o + T, N); k_i++) { C[i_i][j_i] += A[i_i][k_i] * B[k_i][j_i]; } } } } } }This restructuring changes the order of operations but maintains mathematical equivalence. The processor now loads a $T \times T$ block of $A$ and $B$ into the cache and computes all necessary partial products using that data before moving to the next block.The following diagram illustrates how the iteration space is partitioned into blocks. The gray area represents the active tile being processed, which must fit entirely within the target cache level.digraph G { rankdir=TB; bgcolor="#ffffff"; node [style=filled, shape=rect, fontname="Arial", fontsize=12]; edge [fontname="Arial", fontsize=10, color="#adb5bd"]; subgraph cluster_matrix { label="Matrix Iteration Space (N x N)"; style=dashed; color="#adb5bd"; fontcolor="#495057"; // Grid of tiles node [color="#dee2e6", fillcolor="#f8f9fa", width=1.2, height=1.2, label=""]; t00 [pos="0,2!"]; t01 [pos="1.5,2!"]; t02 [pos="3,2!"]; t10 [pos="0,0.5!"]; t11 [fillcolor="#a5d8ff", label="Active Tile\n(Fits in L1)"]; t12 [pos="3,0.5!"]; t20 [pos="0,-1!"]; t21 [pos="1.5,-1!"]; t22 [pos="3,-1!"]; edge [style=invis]; t00 -> t01 -> t02; t10 -> t11 -> t12; t20 -> t21 -> t22; t00 -> t10 -> t20; } memory [shape=cylinder, fillcolor="#e9ecef", label="Main Memory", width=1.5]; cache [shape=box, fillcolor="#b2f2bb", label="L1 Cache", width=1.5]; memory -> cache [label="Load Block", color="#1c7ed6", penwidth=2]; cache -> t11 [label="High Reuse", color="#1c7ed6", penwidth=2]; }Decomposition of a large matrix iteration space into smaller tiles. The active tile (blue) represents the working set currently resident in the L1 cache.Analytical Cost ModelWe can quantify the benefits of tiling by calculating the Data Reuse Factor. Let's assume a simplified cache model where the cache can hold the three $T \times T$ blocks required for the inner loop.Without tiling, the total number of memory transfers for $A$ (assuming strict LRU eviction and large $N$) approaches $N^3$, as rows of $A$ are re-fetched for every column of $B$.With tiling, to compute a $T \times T$ block of $C$, we need to iterate through $N/T$ blocks of $A$ and $B$.Size of block $C_{tile}$: $T^2$ (stays in registers/L1).Total loads of $A$: We load the entire matrix $A$. Since we reuse $A$-blocks for different $j$-blocks of $C$, the global traffic for $A$ is $(N/T) \times N^2$.Total loads of $B$: Similarly, $(N/T) \times N^2$.The arithmetic intensity (FLOPS per byte) improves linearly with the tile size $T$:$$ \text{Operational Intensity} \approx \frac{2N^3 \text{ (Total Ops)}}{2 \frac{N^3}{T} \text{ (Total Bytes)}} = T $$Larger tiles equate to better bandwidth efficiency, provided they fit in the cache. If $T$ is chosen such that $3T^2 \times \text{sizeof(float)} > \text{CacheSize}$, cache thrashing occurs, and performance degrades sharply.The chart below demonstrates the theoretical relationship between tile size and effective memory bandwidth utilization. Note the "Performance Cliff" where the tile size exceeds cache capacity.{"layout": {"title": "Effective Bandwidth vs Tile Size", "xaxis": {"title": "Tile Size (elements)", "showgrid": true}, "yaxis": {"title": "Effective Bandwidth (GB/s)", "showgrid": true}, "plot_bgcolor": "#f8f9fa", "paper_bgcolor": "#ffffff", "font": {"family": "Arial, sans-serif"}}, "data": [{"x": [8, 16, 32, 64, 128, 256, 512], "y": [100, 180, 280, 350, 50, 40, 30], "type": "scatter", "mode": "lines+markers", "line": {"color": "#1c7ed6", "width": 3, "shape": "spline"}, "marker": {"size": 8, "color": "#228be6"}, "name": "L1 Cache Perf"}, {"x": [64, 64], "y": [0, 350], "type": "line", "mode": "lines", "line": {"color": "#fa5252", "dash": "dash", "width": 2}, "name": "L1 Capacity Limit"}]}Impact of tile size on memory bandwidth. Performance increases with tile size due to better reuse until the working set exceeds the L1 cache capacity (dashed red line), causing a drop due to thrashing.Multi-Level Tiling and Register BlockingModern CPUs and GPUs have multiple layers of memory (Registers, L1, L2, VRAM/DRAM). Advanced compilers do not stop at a single level of tiling. They perform multi-level tiling to maximize reuse at every level of the hierarchy.L2/L3 Tiling: A coarse-grained blocking strategy where large tiles (e.g., $128 \times 128$) are kept in the L2 cache.L1 Tiling: Inside the L2 tile, we iterate over smaller sub-tiles (e.g., $32 \times 32$) that fit in L1.Register Tiling (Micro-Kernel): At the innermost level, the compiler unrolls loops to keep a small grid (e.g., $4 \times 8$) of accumulator variables in CPU registers. This prevents the ALU from stalling while waiting for L1 cache access.Compiler Implementation in TVMIn deep learning compilers like Apache TVM or Halide, tiling is expressed as a scheduling primitive separate from the algorithm definition. This separation allows the search for optimal tile sizes ($T$) to be automated.For example, in TVM's te (Tensor Expression) language, a tiling schedule is applied to the axes of the computation:# Define the computation A = te.placeholder((N, N), name='A') B = te.placeholder((N, N), name='B') k = te.reduce_axis((0, N), name='k') C = te.compute((N, N), lambda i, j: te.sum(A[i, k] * B[k, j], axis=k)) # Create a schedule s = te.create_schedule(C.op) # Apply tiling # split the i and j axes by factors bn and bm bn, bm = 32, 32 xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bm) # Reorder to put outer loops first, then reduction, then inner loops k_outer, k_inner = s[C].split(k, factor=4) s[C].reorder(xo, yo, k_outer, xi, yi, k_inner)The tile primitive effectively splits the axes and reorders them. The subsequent reorder command is important; it ensures that the loops iterating over the cache-resident blocks (xi, yi) are the innermost spatial loops.Constraints and PaddingTiling introduces complexity when the matrix dimension $N$ is not perfectly divisible by the tile size $T$. This results in "remainder" loops, small slivers of computation at the edges of the matrix.Compilers handle this in two ways:Loop Splitting: Generating a separate, unoptimized loop body for the boundary conditions. This maintains code compactness for the main kernel but increases binary size.Padding: Allocating slightly larger input tensors (rounded up to the nearest multiple of $T$) and filling the extra space with zeros (or identity values depending on the operation). This allows the optimized kernel to run uniformly over the entire domain, often yielding better performance by avoiding branch divergence within the kernel.By mastering loop tiling, you define the granularity at which the hardware consumes data. This is the primary step in converting a mathematical definition into a high-performance binary. Subsequent sections will examine how to vectorize these tiles to utilize SIMD instructions.