Redundancy is a frequent artifact in machine learning models. While a data scientist typically avoids writing the same line of Python code twice. the expansion of high-level operators into low-level primitives often introduces duplicate calculations. Common Subexpression Elimination (CSE) is a fundamental optimization pass that detects these redundancies and restructures the graph to compute a result once and reuse it.In the context of a generic compiler. CSE identifies identical expressions like $a + b$ appearing in multiple locations. If neither $a$ nor $b$ changes between these occurrences. the compiler calculates the sum once. stores it in a temporary variable. and replaces subsequent instances with that variable.For machine learning compilers working with computation graphs (DAGs). the process operates on nodes and edges rather than lines of source code. The goal is to identify subgraphs that produce the exact same tensor result. removing the duplicate subgraph effectively reduces the arithmetic operations (FLOPs) required for a forward pass.Structural EquivalenceA compiler cannot assume two operations are identical simply because they use the same operator type. To safely perform CSE. the compiler must establish structural equivalence. Two nodes are structurally equivalent if they satisfy three conditions:Identical Opcode: Both nodes perform the same operation. such as MatMul or ReLU.Identical Inputs: The incoming edges for both nodes originate from the exact same source nodes in the graph.Identical Attributes: Any static parameters associated with the node. such as strides in a convolution or the epsilon value in normalization. must match exactly.Consider a scenario where a model computes two branches. Both branches begin by normalizing the input data. In the raw graph representation. this appears as two distinct Batch_Norm nodes. Since they share the same input tensor and parameters. the compiler collapses them into a single node.digraph G { rankdir=TB; bgcolor="transparent"; node [style=filled, fontname="Arial", shape=box, fontsize=12, color="#adb5bd"]; edge [color="#868e96"]; subgraph cluster_0 { label="Before CSE"; fontname="Arial"; color="#dee2e6"; style="rounded"; in1 [label="Input Tensor", fillcolor="#e9ecef"]; op1 [label="Conv2D (3x3)", fillcolor="#a5d8ff"]; op2 [label="Conv2D (3x3)", fillcolor="#ffc9c9"]; out1 [label="ReLU", fillcolor="#e9ecef"]; out2 [label="Sigmoid", fillcolor="#e9ecef"]; in1 -> op1; in1 -> op2; op1 -> out1; op2 -> out2; } subgraph cluster_1 { label="After CSE"; fontname="Arial"; color="#dee2e6"; style="rounded"; in_opt [label="Input Tensor", fillcolor="#e9ecef"]; op_opt [label="Conv2D (3x3)", fillcolor="#a5d8ff"]; out1_opt [label="ReLU", fillcolor="#e9ecef"]; out2_opt [label="Sigmoid", fillcolor="#e9ecef"]; in_opt -> op_opt; op_opt -> out1_opt; op_opt -> out2_opt; } }Identification of redundant nodes in a computation graph. The red node in the first subgraph represents a duplicate operation that is eliminated in the second subgraph by reusing the output of the blue node.The CSE AlgorithmImplementing CSE requires an efficient method to detect duplicates without comparing every node against every other node. resulting in an $O(N^2)$ complexity. Instead. compilers typically use a hashing approach to achieve near-linear time complexity.The pass traverses the graph in topological order. For each node visited. the compiler generates a hash signature based on the node's operation type. input indices. and attributes. This signature serves as a unique identifier for the computation being performed.$$ H_{node} = \text{hash}(\text{opcode}, [H_{input_1}, H_{input_2}, ...], \text{attributes}) $$The compiler maintains a hash map (or a dictionary) where the keys are these signatures and the values are references to the nodes that generated them.Visit Node A: Compute hash $H_A$. Check the map. If $H_A$ is not found. add $(H_A \rightarrow Node_A)$ to the map.Visit Node B: Compute hash $H_B$. If Node B performs the same operation on the same inputs as Node A. then $H_B$ will equal $H_A$.Detection: The compiler finds $H_B$ in the map and retrieves the reference to Node A.Replacement: The compiler reroutes all edges that were originally pointing to Node B so that they now point to Node A.Removal: Node B is left with no outputs. A subsequent Dead Code Elimination (DCE) pass will remove it from the graph.Common Scenarios in Deep LearningWhile users rarely write explicitly redundant code. duplicated subexpressions arise naturally from higher-level abstractions.Multi-Head Attention In Transformer architectures, the Query, Key, and Value projections often involve slicing a large tensor or applying linear transformations. If the specific implementation separates these operations but uses identical weight matrices for initialization or reshaping logic, CSE can consolidate the setup steps.Gradient Computation The most prevalent source of redundancy is the backward pass. Automatic differentiation engines generate gradients by applying the chain rule. This process frequently produces subgraphs that mirror parts of the forward pass or duplicate calculations across different branches of the gradient graph. Compilers like XLA (Accelerated Linear Algebra) or TVM rely heavily on CSE to clean up the verbose graphs generated by autograd systems.Hyperparameter Search and Architecture Search When using Neural Architecture Search (NAS). the generated candidate graphs often contain overlapping branches. CSE helps in identifying the shared components. allowing the system to verify if parts of the model have already been computed.Memory and Lifetime ImplicationsAlthough CSE reduces the operation count. it is not always strictly beneficial for performance. The trade-off lies in memory usage.When a subexpression is eliminated. the result of the remaining node must be kept alive in memory until the last consumer reads it. If the two original occurrences were far apart in the execution schedule. the optimized graph requires storing the tensor for a longer duration. increasing the peak memory footprint.Consider a tensor $T$ used at the beginning of the network and again at the very end.Without CSE: $T$ is computed. used. and discarded. Then $T$ is recomputed at the end. used. and discarded. Peak memory is low.With CSE: $T$ is computed once. It must sit in GPU memory for the entire duration of the inference pass so it can be used at the end. This increases memory pressure.Advanced compilers use a cost model to decide whether to apply CSE. If recomputing a value is cheap (like a scalar addition) but storing it is expensive (high register pressure or blocking memory allocation). the compiler might choose to perform Rematerialization, deliberately skipping CSE to save memory.Verification ExampleTo understand how a compiler views this. look at the textual representation of a graph (IR). Suppose we have a function that takes a tensor %x.Original IR:%1 = multiply(%x, 2.0) %2 = add(%1, 5.0) %3 = multiply(%x, 2.0) // Redundant calculation %4 = subtract(%3, 1.0) %5 = add(%2, %4)The compiler tracks the definition of %1. When it encounters %3. it sees that the opcode multiply. the input %x. and the constant 2.0 match the definition of %1.Optimized IR:%1 = multiply(%x, 2.0) %2 = add(%1, 5.0) // %3 is replaced by %1 %4 = subtract(%1, 1.0) %5 = add(%2, %4)The operation count drops from 5 to 4. In large-scale models with billions of parameters. aggregating these small savings across repeated blocks results in measurable latency improvements. By systematically identifying these patterns. the compiler ensures that the hardware spends cycles only on unique. necessary mathematics.