Matrix multiplication serves as the computational backbone for nearly every deep learning model. From dense layers in multilayer perceptrons to the query-key-value projections in transformers, the efficiency of General Matrix Multiply (GEMM) operations directly dictates the training speed and inference latency of neural networks. Understanding how to optimize this single kernel provides the foundation for optimizing virtually any tensor operator.In this section, we will apply the scheduling primitives discussed earlier, reordering, tiling, and vectorization, to transform a naive matrix multiplication implementation into a high-performance kernel. We will observe how each transformation impacts memory access patterns and arithmetic intensity.The Baseline ImplementationLet us consider the multiplication of two matrices, $A$ (size $M \times K$) and $B$ (size $K \times N$), to produce matrix $C$ (size $M \times N$). The mathematical definition is:$$ C_{i,j} = \sum_{k=0}^{K-1} A_{i,k} \times B_{k,j} $$In a standard row-major storage format, elements of a row are stored contiguously in memory. A direct translation of the mathematical definition into code results in a triple-nested loop. We refer to this as the "naive" implementation.# Naive implementation for i in range(M): for j in range(N): for k in range(K): C[i, j] += A[i, k] * B[k, j]While correct, this implementation is inefficient on modern hardware. The innermost loop iterates over $k$. For matrix $A$, we access $A[i, k]$, which moves sequentially through memory as $k$ increments. However, for matrix $B$, we access $B[k, j]$. Since $B$ is stored in row-major order, incrementing $k$ means jumping over $N$ elements in memory to reach the next row.This strided access pattern causes frequent cache misses. The CPU fetches a cache line containing $B[k, j]$, uses only one value, and then discards it to fetch the next row. This limits performance to the speed of main memory rather than the speed of the processor.Optimization 1: Loop ReorderingThe first and often simplest optimization is changing the order of the loops. We want to ensure that the innermost loop accesses memory contiguously for the matrices being read and written. By swapping the j and k loops, we change the iteration pattern.# Reordered implementation for i in range(M): for k in range(K): for j in range(N): C[i, j] += A[i, k] * B[k, j]In this configuration:$A[i, k]$ is constant with respect to the inner loop j. The compiler can load it into a register once and reuse it $N$ times.$B[k, j]$ is accessed sequentially as j increments. This maximizes spatial locality, allowing the CPU to utilize every element in a fetched cache line.$C[i, j]$ is also accessed sequentially.This simple change often yields significant speedups by reducing memory traffic, even though the total number of arithmetic operations remains exactly the same.Optimization 2: Loop TilingWhile reordering improves spatial locality, it does not inherently solve the problem of temporal locality for large matrices. If the matrices $A$, $B$, and $C$ are too large to fit entirely in the L1 or L2 cache, the processor will evict data that it needs later, leading to "cache thrashing."Loop tiling (also known as blocking) addresses this by partitioning the execution into smaller sub-matrices (tiles) that fit within the cache. We split the iteration space into outer loops (iterating over tiles) and inner loops (iterating within a tile).digraph G { rankdir=TB; node [shape=rect, style=filled, fontname="Arial"]; A [label="Matrix A", fillcolor="#4dabf7", pos="0,0"]; Tile [label="Tile (Block)\nFits in L1 Cache", fillcolor="#69db7c", style=dashed]; A -> Tile [label=" Partitioning", fontsize=10]; }Tiling partitions the global iteration space into smaller blocks to maximize data reuse within the cache hierarchy.Consider tiling the i and j loops by factors bn and bm. The structure transforms from a triple loop into a six-level loop nest (or fewer, depending on which dimensions are tiled). By processing a small block of $C$ at a time, we ensure that the corresponding blocks of $A$ and $B$ remain hot in the cache during the computation.The choice of tile size is hardware-dependent. It requires balancing the size of the cache with the register pressure. If tiles are too small, we lose the benefit of amortization. If tiles are too large, data spills out of the cache.Optimization 3: VectorizationModern processors support SIMD (Single Instruction, Multiple Data) extensions such as AVX-512 or NEON. These instructions allow the CPU to perform the same operation on multiple data points simultaneously. For example, instead of multiplying two floating-point numbers, a single instruction might multiply two vectors of 8 or 16 floats.Compilers can automatically vectorize loops, but only if the memory access patterns are predictable and the dependencies are clear. The reordered loop structure (i, k, j) facilitates this because the innermost loop over j performs element-wise additions and multiplications on contiguous memory.To explicitly target SIMD units, we often unroll the innermost loop or mark it with a vectorization pragma. In an ML compiler stack like TVM or MLIR, this is done during the scheduling phase by splitting the inner loop by the vector lane size and binding the inner split to a vector thread.Performance Impact AnalysisCombining these techniques results in a kernel that operates orders of magnitude faster than the naive baseline. The following chart illustrates the cumulative performance gain typically observed when optimizing a dense matrix multiplication on a standard CPU.{"layout": {"title": {"text": "Relative Performance of Optimization Steps", "font": {"family": "Arial", "size": 18}}, "xaxis": {"title": {"text": "Optimization Strategy", "font": {"family": "Arial"}}, "showgrid": false}, "yaxis": {"title": {"text": "Speedup Factor (Baseline = 1x)", "font": {"family": "Arial"}}, "gridcolor": "#e9ecef"}, "plot_bgcolor": "white", "margin": {"t": 50, "b": 50, "l": 50, "r": 50}, "font": {"family": "Arial"}}, "data": [{"type": "bar", "x": ["Naive Loop", "Loop Reordering", "Tiling (Blocking)", "Vectorization + Unrolling"], "y": [1, 5.2, 12.4, 24.8], "marker": {"color": ["#adb5bd", "#4dabf7", "#69db7c", "#fd7e14"]}}]}Cumulative speedup factors observed when applying sequential optimizations to a matrix multiplication kernel.Putting It TogetherIn a practical compiler workflow, you do not write these loops by hand in C++. Instead, you define the computation and then apply a schedule.Definition: You define $C = A \times B$ using a high-level DSL.Schedule: You create a schedule object.Transform:split: You split axes i and j into outer and inner loops to define tile sizes.reorder: You reorder the axes to bring the tile loops to the top and the dense computation loops to the bottom.vectorize: You mark the innermost iterative axis to be replaced by SIMD instructions.The final generated assembly code will heavily utilize vector registers and minimize jumps between memory addresses. This process of inspecting the loop nest, identifying bottlenecks in memory access, and systematically applying scheduling primitives is the essence of kernel optimization. While auto-tuning can find the specific tile sizes (e.g. is 32x32 better than 64x64?), understanding the structural transformations allows you to define the search space effectively.