Operator fusion is a foundational graph optimization, primarily aimed at reducing the overhead associated with kernel launches and intermediate memory traffic. Simple fusion strategies often combine vertically adjacent, element-wise operations (like Add
followed by ReLU
) or sometimes horizontally parallel operations writing to the same output buffer. However, achieving maximum performance, especially on modern accelerators with high compute density but often limited memory bandwidth, requires more aggressive fusion techniques. These go beyond simple patterns and rely on sophisticated analysis and cost modeling.
Aggressive fusion seeks to combine operations even when the pattern isn't a straightforward linear chain or purely element-wise. The core idea is to maximize the work done per kernel launch and minimize data movement between main memory and the compute units.
A significant performance gain comes from fusing compute-intensive operations (like convolution or matrix multiplication) with subsequent element-wise operations. Consider a common pattern: Conv -> BiasAdd -> ReLU
.
Fusion of a compute-bound operation (Conv2D) with subsequent element-wise operations (BiasAdd, ReLU) into a single kernel. The intermediate tensors
feature_map
andbiased_map
are eliminated from global memory.
In the unfused case, the result of Conv2D
(feature_map
) is written to memory, then read by BiasAdd
. Its result (biased_map
) is written to memory, and then read by ReLU
. Each step involves significant memory I/O and a separate kernel invocation. Fusing these into a single kernel (Fused: Conv2D_BiasAdd_ReLU
) allows the intermediate results to potentially stay within registers or caches (like GPU shared memory or CPU L1/L2 cache), drastically reducing memory bandwidth consumption and latency. The element-wise operations (BiasAdd, ReLU) often have negligible compute cost compared to the memory access time they save when fused. This type of fusion, combining compute-bound producers with element-wise consumers, is highly effective.
Aggressive fusion isn't limited to linear chains. It can handle scenarios where an operation's output is consumed by multiple subsequent operations.
Fusion involving a node (A) with multiple consumers (B, C). The intermediate result is computed once and used immediately by the fused logic for both B and C within the same kernel, avoiding writing it to memory.
Here, Op A
produces an intermediate result used by both Op B
and Op C
. If B
and C
are suitable candidates (e.g., element-wise), they can be fused with A
. The fused kernel computes the result of A
and then immediately computes both B
and C
using that result, potentially keeping the intermediate data in registers or on-chip cache. This avoids writing the intermediate tensor back to memory and then reading it twice (once for B, once for C). This pattern effectively combines vertical fusion (A with B, A with C) and implicitly handles the common producer scenario.
Deciding whether to fuse operations aggressively requires a cost model. Blindly fusing everything can be detrimental. For instance, fusing too many compute-intensive operations might create a kernel that requires more registers than available (causing register spilling to local or global memory, which is slow), or lead to instruction cache misses if the combined kernel code becomes too large and complex.
A practical cost model estimates the performance impact of a potential fusion based on several factors:
The compiler uses this model, often parameterized for specific hardware targets, to evaluate potential fusion candidates (pairs or groups of operations). A fusion is performed only if the estimated cost (e.g., execution time) of the fused configuration is predicted to be lower than the cost of the unfused configuration.
PredictedCost(FusedOp)<∑i∈OpsPredictedCost(Opi)+∑j∈IntermediatesPredictedCost(MemIOj)+Nsaved×PredictedCost(Launch)
Different hardware targets necessitate different cost models and heuristics. A memory-bandwidth-bound GPU might favor fusion more aggressively to reduce off-chip memory access, while a CPU with large, fast caches might prioritize fusion differently, perhaps focusing more on instruction pipeline utilization.
In certain situations, it might be beneficial to fuse an operation with one of its consumers even if another consumer also needs the original output. Instead of forcing the intermediate result to be written to memory for the second consumer, the compiler might choose to recompute the intermediate value within a separate fused kernel for that second consumer. This is viable if the computation cost of the intermediate operation is low compared to the cost of a round trip to main memory. This strategy is particularly relevant for operations that are computationally inexpensive but produce large tensors, such as broadcasting or certain element-wise transformations.
Aggressive fusion often interacts closely with data layout transformations (discussed in memory-aware-layout-transformations
). Sometimes, fusion is only possible or beneficial if the layouts of the involved tensors match (e.g., both expect NCHW) or are transformed as part of the fusion process. For instance, a fused Conv-ReLU kernel might internally operate in NHWC for efficiency on a specific target, even if the surrounding graph uses NCHW, incorporating layout changes within the kernel's prologue and epilogue. Conversely, fusing operations can simplify the graph and remove intermediate tensors, potentially enabling a beneficial layout transformation for subsequent operations that was previously blocked.
Implementing effective and aggressive fusion strategies presents several engineering challenges:
Despite these complexities, aggressive operator fusion is an indispensable technique in modern ML compilers and runtimes (like TensorFlow XLA, PyTorch JIT (via NNC/Fuser), Apache TVM, ONNX Runtime, TensorRT). The performance improvements, primarily driven by the substantial reduction in memory bandwidth requirements and kernel launch overhead, are often critical for achieving competitive inference and training speeds on contemporary hardware. It effectively restructures the computation graph into larger, more hardware-friendly execution units.
© 2025 ApX Machine Learning