Modern deep learning accelerators are notoriously bandwidth-bound. While compute capability (FLOPs) has scaled exponentially, memory bandwidth has lagged behind. Consequently, the primary bottleneck in model execution is often not how fast the chip can calculate, but how fast it can move tensors between global memory (HBM/DRAM) and on-chip registers. Operator fusion is the single most effective graph-level optimization to address this disparity.By identifying patterns in the computational graph where multiple operators can be executed in a single kernel pass, fusion reduces the total volume of memory traffic. It eliminates the need to write intermediate results to global memory only to read them back immediately for the subsequent operation. This increases the arithmetic intensity of the kernel, the ratio of floating-point operations to bytes accessed, allowing the hardware to operate closer to its theoretical peak compute performance.Vertical FusionVertical fusion, also known as producer-consumer fusion, combines sequential operations where the output of one node is the input to the next. This is the most common form of optimization found in standard compiler pipelines like TVM or XLA.Consider a standard convolutional layer block typically found in architectures like ResNet: a convolution followed by a bias addition and a ReLU activation. Without fusion, the execution flow requires three distinct kernel launches:Conv2D: Read Input + Weights $\rightarrow$ Compute $\rightarrow$ Write Temp1 to HBM.BiasAdd: Read Temp1 + Bias $\rightarrow$ Compute $\rightarrow$ Write Temp2 to HBM.ReLU: Read Temp2 $\rightarrow$ Compute $\rightarrow$ Write Output to HBM.This sequence incurs significant redundancy. The intermediate tensors Temp1 and Temp2 are transient; they are not required for any other part of the network. In a fused strategy, the compiler generates a single kernel. The convolution output is held in a register or accumulated in a local cache (like shared memory), the bias is added, and the ReLU non-linearity is applied before the final result is ever written to global memory.digraph VerticalFusion { rankdir=TB; bgcolor="transparent"; node [shape=box, style="filled", fillcolor="#e9ecef", color="#adb5bd", fontname="Arial", fontsize=12]; edge [color="#495057", penwidth=1.2]; subgraph cluster_unfused { label="Unfused Sequence"; fontname="Arial"; color="#ced4da"; style="dashed"; Input [label="Input Tensor"]; Conv [label="Conv2D", fillcolor="#a5d8ff"]; Mem1 [label="Write/Read HBM", shape=ellipse, fillcolor="#ffc9c9"]; Add [label="BiasAdd", fillcolor="#a5d8ff"]; Mem2 [label="Write/Read HBM", shape=ellipse, fillcolor="#ffc9c9"]; Relu [label="ReLU", fillcolor="#a5d8ff"]; Input -> Conv -> Mem1 -> Add -> Mem2 -> Relu; } subgraph cluster_fused { label="Fused Sequence"; fontname="Arial"; color="#ced4da"; style="dashed"; Input2 [label="Input Tensor"]; Fused [label="Fused Kernel\n(Conv2D + Bias + ReLU)", fillcolor="#69db7c", height=1.5]; Output [label="Output Tensor"]; Input2 -> Fused -> Output; } }Comparison of memory traffic between discrete execution and vertical fusion. The fused kernel eliminates intermediate round-trips to global memory.To implement this, the compiler must ensure that the producer node (e.g., Convolution) has no other consumers. If Temp1 were used by a separate branch of the graph (e.g., a skip connection), simple fusion implies recomputing Temp1 for the second branch or abandoning fusion to save the intermediate state. Most modern compilers use a cost model to decide whether the cost of re-computation is cheaper than the memory bandwidth cost of storing the intermediate tensor.Horizontal FusionHorizontal fusion combines operators that are independent of each other but share the same input data. This pattern appears frequently in multi-head attention mechanisms in Transformers or Inception modules in CNNs.In a Transformer, a single input embedding is projected into Query ($Q$), Key ($K$), and Value ($V$) tensors using three separate matrix multiplications (Dense layers). If executed separately, the GPU must read the input embedding from memory three times.Horizontal fusion merges these three operations into a single, larger matrix multiplication. By concatenating the weight matrices of the $Q$, $K$, and $V$ projections, the compiler creates a single kernel. This strategy provides two benefits:Reduced Memory Reads: The input tensor is read once and reused for all three projections.Improved Occupancy: Larger kernels typically saturate the GPU compute units more effectively than several smaller, fragmented kernels.Fusion Rules and Topological ConstraintsNot all graph nodes can be fused indiscriminately. The compiler must categorize operators based on their memory access patterns to determine compatibility. We typically classify operators into four categories for fusion logic:Injective (Element-wise): Operations where one output element corresponds to one input element (e.g., Add, ReLU, Cast). These are the easiest to fuse.Broadcast: Operations that duplicate data along a dimension (e.g., adding a bias vector to a matrix). These can generally be fused with injective ops.Reduction: Operations that map a tensor to a smaller dimension (e.g., Sum, MaxPool). Fusing reductions with injective operators is possible (e.g., BatchNorm involves reductions) but requires complex loop scheduling to handle thread synchronization boundaries.Opaque: Operations that cannot be easily analyzed or require global synchronization (e.g., Sort, Non-Maximum Suppression). These act as fusion barriers.The fusion algorithm typically traverses the graph (often in post-order) and greedily merges nodes based on these categories. A standard rule is that you can fuse any number of injective operations into a reduction, but you cannot easily fuse a reduction into another reduction without tiling optimizations that will be discussed in the loop scheduling chapter.{"layout": {"title": "Impact of Fusion on Arithmetic Intensity", "xaxis": {"title": "Operation Type", "showgrid": false}, "yaxis": {"title": "FLOPs / Byte (Log Scale)", "type": "log", "gridcolor": "#e9ecef"}, "plot_bgcolor": "rgba(0,0,0,0)", "paper_bgcolor": "rgba(0,0,0,0)", "font": {"family": "Arial", "color": "#495057"}, "autosize": true, "height": 400, "margin": {"l": 50, "r": 50, "b": 50, "t": 50}}, "data": [{"type": "bar", "x": ["Pointwise (Unfused)", "GEMM (Unfused)", "Fused Block"], "y": [0.25, 10, 18], "marker": {"color": ["#adb5bd", "#4dabf7", "#69db7c"]}}]}Arithmetic intensity comparison. Fused blocks significantly increase the compute-to-memory ratio, moving the workload from a memory-bound regime toward a compute-bound regime.Challenges in Code GenerationWhile fusion simplifies the graph structure, it complicates code generation. A fused kernel requires more registers to hold the active state of multiple operations. If the register pressure exceeds the hardware limit per thread, the compiler often spills data to local memory (L1/L2 cache) or limits the number of active threads (occupancy), which can degrade performance.Furthermore, fusing operators with mismatched loop bounds requires careful index alignment. For instance, fusing a $3 \times 3$ convolution with a $2 \times 2$ max pooling layer is difficult because their iteration spaces do not map one-to-one. In such cases, compilers might employ "compute at" scheduling primitives, effectively nesting the producer's loop nest inside the consumer's loop nest, calculating patches of the input on-demand. This trade-off between re-computation and memory storage is a recurring theme in compiler engineering.