Modern deep learning accelerators provide massive arithmetic throughput. However, achieving peak utilization is rarely limited by the speed of the floating-point units. Instead, the bottleneck is almost always the rate at which data moves from memory to the processor. The cost of fetching operands from DRAM can be orders of magnitude higher than the cost of performing the multiplication itself. This discrepancy is commonly referred to as the memory wall.
To mitigate this latency, hardware architects employ a hierarchy of caches (L1, L2, L3). Loop tiling, also known as loop blocking, is the primary software transformation used to align software execution with this hardware hierarchy. It partitions the iteration space of a loop nest into smaller blocks that fit into a specific level of fast memory. This ensures that once a data chunk is loaded, it is reused as many times as possible before being evicted.
Consider a standard dense matrix multiplication , where all matrices are of size . A naive implementation typically consists of three nested loops:
In a row-major storage format, the access patterns reveal a performance hazard:
When is sufficiently large, the working set size exceeds the cache capacity. By the time the program finishes one row of and needs to reuse elements of , the cache lines containing have likely been evicted.
Tiling addresses this by introducing new loops to traverse the data in blocks. We split the original iteration space into outer loops that step over tiles, and inner loops that iterate within a tile.
If we choose a tile size of , the transformed logic looks like this:
// Stepping through tiles
for (int i_o = 0; i_o < N; i_o += T) {
for (int j_o = 0; j_o < N; j_o += T) {
for (int k_o = 0; k_o < N; k_o += T) {
// Executing the tile (fits in cache)
for (int i_i = i_o; i_i < min(i_o + T, N); i_i++) {
for (int j_i = j_o; j_i < min(j_o + T, N); j_i++) {
for (int k_i = k_o; k_i < min(k_o + T, N); k_i++) {
C[i_i][j_i] += A[i_i][k_i] * B[k_i][j_i];
}
}
}
}
}
}
This restructuring changes the order of operations but maintains mathematical equivalence. The processor now loads a block of and into the cache and computes all necessary partial products using that data before moving to the next block.
The following diagram illustrates how the iteration space is partitioned into blocks. The gray area represents the active tile being processed, which must fit entirely within the target cache level.
Decomposition of a large matrix iteration space into smaller tiles. The active tile (blue) represents the working set currently resident in the L1 cache.
We can quantify the benefits of tiling by calculating the Data Reuse Factor. Let's assume a simplified cache model where the cache can hold the three blocks required for the inner loop.
Without tiling, the total number of memory transfers for (assuming strict LRU eviction and large ) approaches , as rows of are re-fetched for every column of .
With tiling, to compute a block of , we need to iterate through blocks of and .
The arithmetic intensity (FLOPS per byte) improves linearly with the tile size :
Larger tiles equate to better bandwidth efficiency, provided they fit in the cache. If is chosen such that , cache thrashing occurs, and performance degrades sharply.
The chart below demonstrates the theoretical relationship between tile size and effective memory bandwidth utilization. Note the "Performance Cliff" where the tile size exceeds cache capacity.
Impact of tile size on memory bandwidth. Performance increases with tile size due to better reuse until the working set exceeds the L1 cache capacity (dashed red line), causing a drop due to thrashing.
Modern CPUs and GPUs have multiple layers of memory (Registers, L1, L2, VRAM/DRAM). Advanced compilers do not stop at a single level of tiling. They perform multi-level tiling to maximize reuse at every level of the hierarchy.
In deep learning compilers like Apache TVM or Halide, tiling is expressed as a scheduling primitive separate from the algorithm definition. This separation allows the search for optimal tile sizes () to be automated.
For example, in TVM's te (Tensor Expression) language, a tiling schedule is applied to the axes of the computation:
# Define the computation
A = te.placeholder((N, N), name='A')
B = te.placeholder((N, N), name='B')
k = te.reduce_axis((0, N), name='k')
C = te.compute((N, N), lambda i, j: te.sum(A[i, k] * B[k, j], axis=k))
# Create a schedule
s = te.create_schedule(C.op)
# Apply tiling
# split the i and j axes by factors bn and bm
bn, bm = 32, 32
xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bm)
# Reorder to put outer loops first, then reduction, then inner loops
k_outer, k_inner = s[C].split(k, factor=4)
s[C].reorder(xo, yo, k_outer, xi, yi, k_inner)
The tile primitive effectively splits the axes and reorders them. The subsequent reorder command is important; it ensures that the loops iterating over the cache-resident blocks (xi, yi) are the innermost spatial loops.
Tiling introduces complexity when the matrix dimension is not perfectly divisible by the tile size . This results in "remainder" loops, small slivers of computation at the edges of the matrix.
Compilers handle this in two ways:
By mastering loop tiling, you define the granularity at which the hardware consumes data. This is the primary step in converting a mathematical definition into a high-performance binary. Subsequent sections will examine how to vectorize these tiles to utilize SIMD instructions.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•