Memory bandwidth often acts as the primary bottleneck in deep learning inference and training. While modern GPUs and TPUs possess massive arithmetic throughput, their performance is frequently limited by the speed at which data moves between the high-bandwidth memory (HBM) and the compute units. Operator fusion stands as the most effective graph-level optimization to mitigate this bottleneck.
To understand the necessity of fusion, consider a simple sequence of operations often found in neural networks: a matrix multiplication followed by a bias addition and a ReLU activation function.
In a standard deep learning framework like PyTorch or TensorFlow, execution occurs eagerly or via a graph that treats each operation as a discrete kernel launch. Without optimization, the hardware executes this sequence in three distinct steps:
This approach is inefficient because the intermediate tensors and are transient. They are only required for the subsequent step, yet they incur the heavy penalty of off-chip memory access. The arithmetic intensity, the ratio of floating-point operations (FLOPs) to memory bytes accessed, is low for the addition and ReLU steps, leaving the expensive compute cores idle while waiting for memory.
Operator fusion addresses this by combining these adjacent nodes into a single kernel. Instead of writing intermediate results to global memory, the fused kernel keeps data in the faster on-chip memory (registers or L1 cache). The processor loads the inputs once, performs the matrix multiplication, adds the bias, applies the ReLU, and writes only the final result.
Comparison of data movement in unfused versus fused operations. The fused approach eliminates round-trips to global memory.
Compilers categorize fusion opportunities based on the iteration patterns of the operations involved. Not all operations can be fused easily; the decision depends on how the data maps from input to output.
This is the simplest and most common form of fusion. It applies when operations map one input element to one output element (injective mappings). Examples include tensor addition, subtraction, multiplication, ReLU, Sigmoid, and Tanh.
Since the loop structures for these operations are identical (iterating over the same tensor shape), the compiler merges the loop bodies.
Unfused Loop Nest (Pseudo-code):
// Kernel 1: Add
for (int i = 0; i < N; i++) {
temp[i] = A[i] + B[i];
}
// Kernel 2: ReLU
for (int i = 0; i < N; i++) {
C[i] = max(0, temp[i]);
}
Fused Loop Nest:
// Fused Kernel
for (int i = 0; i < N; i++) {
float t = A[i] + B[i]; // Kept in register
C[i] = max(0, t);
}
Reductions involve operations that decrease the rank of a tensor, such as sum, max, or mean. Fusing an element-wise operation into a subsequent reduction is generally straightforward. For example, calculating the sum of squares allows the squaring operation to happen immediately before the accumulation within the same loop.
However, fusing a reduction into a following element-wise operation is more complex. If you compute a sum followed by a division (as in Softmax), the sum must be fully calculated before the division can occur for any element. This introduces a barrier synchronization requirement, often necessitating multiple passes or specialized hardware intrinsics (like warp shuffles on GPUs) to handle efficiently within a single kernel.
Fusing complex operators like Convolution or Matrix Multiplication (GEMM) with following element-wise operations (bias add, activation) is highly impactful. Since GEMM and Convolution are compute-bound, adding a few scalar operations like an addition or a clamp (ReLU) at the end of the computation pipeline costs virtually nothing in terms of execution time. The arithmetic units process the activation while the memory controller is busy handling the write-back of the matrix result.
The compiler cannot simply merge every adjacent node. It must respect data dependencies and ensure the transformation preserves the program's logic. The optimizer builds a Directed Acyclic Graph (DAG) and analyzes the edges to determine fusion validity.
A primary constraint is the cycle check. If Node A feeds into Node B, and we want to fuse them, we must ensure that fusing them does not create a cycle or invalidate another path. For instance, if Node A produces an output used by both Node B and Node C, fusing A into B typically requires A's computation to be duplicated for B, or for A to still be written to memory for C to use.
If the compiler fuses A into B but C still needs the output of A, the compiler must verify if re-computing A inside B is cheaper than reading it from memory. This strategy, known as rematerialization, trades extra computation for reduced memory bandwidth.
ML compilers typically use a rule-based or cost-based approach to identify candidates:
Conv2D -> Add -> ReLU.The chart below illustrates the potential performance gain and the memory traffic reduction when applying fusion to a standard block of operations.
Performance improvements usually stem from reduced global memory traffic rather than reduced operation counts.
In modern compiler stacks like TVM or MLIR, fusion is implemented as a specific "Pass." The pass traverses the Intermediate Representation (IR). When it identifies a fusible pattern, it replaces the sequence of nodes with a single FusedOp node. This new node contains a composite function body.
Later in the compilation pipeline, during code generation, this composite body generates a tight loop nest. The backend handles the allocation of registers for the intermediate variables, ensuring they never touch the main memory allocation system.
By understanding fusion, you can write models that are more compiler-friendly. For instance, using custom compound loss functions often allows the compiler to generate a single optimized kernel, whereas writing the same logic as scattered Python operators might require the compiler to work harder to deduce the fusion opportunity.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with