General Matrix Multiplication (GEMM) defines the workload that dominates modern neural network execution. While high-level frameworks abstract this operation into a single function call like torch.matmul or np.dot, the underlying compiler must generate a sequence of machine instructions that efficiently utilize the CPU or GPU pipeline.Our goal is to transform a naive, mathematically correct implementation into a schedule that respects hardware memory hierarchies. We will use a high-level Python syntax similar to the TVM Schedule API to define these transformations. This approach allows us to decouple the algorithm definition from its execution strategy.The Baseline ImplementationConsider two matrices, $A$ of shape $(M, K)$ and $B$ of shape $(K, N)$, producing a result $C$ of shape $(M, N)$. The mathematical definition describes the computation state:$$C_{i,j} = \sum_{k=0}^{K-1} A_{i,k} \times B_{k,j}$$In a standard imperative language like C++, a direct translation results in a triple-nested loop.// Naive implementation for (int i = 0; i < M; ++i) { for (int j = 0; j < N; ++j) { for (int k = 0; k < K; ++k) { C[i][j] += A[i][k] * B[k][j]; } } }While functional, this code exhibits poor performance on large matrices. The central issue is the access pattern of matrix $B$. Assuming row-major storage, accessing $B[k][j]$ sequentially implies jumping over $N$ elements in memory for every increment of $k$. This strided access pattern leads to frequent cache misses, as the CPU fetches a cache line but only uses a single value before requesting a different memory address far away.Optimization 1: Loop Tiling (Blocking)To mitigate cache thrashing, we apply loop tiling. This technique divides the iteration space into smaller blocks that fit entirely within the L1 or L2 cache. By computing a small sub-matrix of $C$ at a time, we ensure that the required elements of $A$ and $B$ remain in the cache for repeated use.In our schedule, we split the spatial axes $i$ and $j$ into outer and inner loops.# Split axis i into blocks of size bn i_outer, i_inner = s[C].split(axis=i, factor=bn) # Split axis j into blocks of size bn j_outer, j_inner = s[C].split(axis=j, factor=bn) # Split reduction axis k into blocks of size bk k_outer, k_inner = s[C].split(axis=k, factor=bk)This transformation rewrites the execution order. Instead of iterating over the entire row $M$, the processor iterates over small blocks. The visual representation of this decomposition illustrates how the global matrix is processed as a grid of sub-matrices.digraph G { rankdir=TB; node [shape=box, style="filled", color="#dee2e6", fontname="Helvetica"]; edge [fontname="Helvetica"]; subgraph cluster_global { label = "Global Matrix (M x N)"; style=dashed; color="#adb5bd"; Global [label="Full Iteration Space\ni: 0..M, j: 0..N"]; } subgraph cluster_tiled { label = "Tiled Execution"; style=filled; color="#e9ecef"; OuterLoops [label="Outer Loops\nTraverse Block Coordinates", fillcolor="#a5d8ff"]; InnerLoops [label="Inner Loops\nCompute Micro-Kernel (bn x bn)", fillcolor="#96f2d7"]; OuterLoops -> InnerLoops [label="Instantiate"]; } Global -> OuterLoops [label="Split Transformation"]; }Visualization of decomposing the global iteration space into local tiles to improve cache locality.The choice of blocking factors (bn, bk) is critical. If the block size is too small, the loop overhead dominates. If it is too large, the working set exceeds the cache size, reverting the performance to baseline levels.Optimization 2: Loop ReorderingAfter tiling, the order of the nested loops dictates data locality. The default order generated by the split might be i_outer, i_inner, j_outer, j_inner, k_outer, k_inner. However, we want the innermost loops to perform the dense computation on the tile.We typically reorder the loops to push the outer iterators to the top and keep the inner iterators adjacent. A common strategy is to order the loops as: i_outer, j_outer, k_outer, i_inner, j_inner, k_inner.This ordering ensures that for a fixed block of $C$, we iterate through the necessary blocks of $A$ and $B$, maximizing data reuse.s[C].reorder(i_outer, j_outer, k_outer, i_inner, j_inner, k_inner)Optimization 3: Vectorization and LayoutModern processors execute instructions using SIMD (Single Instruction, Multiple Data) units, such as AVX-512 on Intel or NEON on ARM. These units can perform arithmetic on multiple data points (e.g., 8 floats) in a single clock cycle.To leverage SIMD, the innermost loop must map to contiguous memory addresses. If our innermost loop is j_inner and the data layout is row-major, the memory access is contiguous. We explicitly mark this axis for vectorization.s[C].vectorize(j_inner)The compiler will now attempt to replace scalar instructions with vector intrinsics (e.g., vaddps, vmulps). If the memory access is not contiguous, or if there are data dependencies preventing parallel execution, the compiler will fail to vectorize, or worse, generate "gather/scatter" instructions which significantly degrade performance.Optimization 4: Loop UnrollingBranch prediction and loop index maintenance incur overhead. Loop unrolling replicates the loop body multiple times within a single iteration, reducing the number of comparison and jump instructions.We apply unrolling to the reduction axis k_inner inside the micro-kernel.s[C].unroll(k_inner)This directive tells the compiler to explicitly write out the summation steps. While this increases the binary size, it allows the CPU to pipeline instructions more effectively and exposes opportunities for instruction-level parallelism.Performance AnalysisApplying these transformations incrementally yields substantial performance gains. We measure throughput in GFLOPS (Giga Floating Point Operations Per Second). The baseline implementation typically achieves only a fraction of the hardware's theoretical peak due to memory bandwidth bottlenecks.By optimizing the schedule, we shift the bottleneck from memory bandwidth to compute capability. The following chart demonstrates the performance scaling on a standard CPU architecture as we apply each optimization layer.{ "layout": { "title": "Performance Scaling of GEMM Optimizations", "xaxis": { "title": "Optimization Stage" }, "yaxis": { "title": "Throughput (GFLOPS)" }, "barmode": "group", "plot_bgcolor": "#f8f9fa", "paper_bgcolor": "#ffffff", "font": { "family": "Helvetica" } }, "data": [ { "type": "bar", "x": ["Naive Loop", "Tiled/Blocked", "Reordered", "Vectorized + Unrolled"], "y": [2.5, 18.2, 45.6, 112.4], "marker": { "color": ["#adb5bd", "#74c0fc", "#63e6be", "#339af0"] }, "text": ["2.5", "18.2", "45.6", "112.4"], "textposition": "auto" } ] }Throughput comparison across different optimization stages. The jump from tiling implies better cache usage, while vectorization exploits the compute units.The data indicates that while algorithmic correctness is sufficient for functionality, engineering the loop schedule is mandatory for performance. The combination of tiling, reordering, and vectorization aligns the software logic with the physical constraints of the hardware, resulting in speedups of over 40x compared to the naive implementation.In the subsequent sections, we will explore how automated search policies (Auto-tuning) can determine the optimal split factors (bn, bk) for arbitrary hardware targets without manual intervention.