Optimization pipelines in deep learning compilers often function as a cycle of expansion and contraction. While passes like loop tiling or vectorization expand the complexity of the code to fit hardware constraints, graph-level cleanup passes are responsible for the contraction phase. Dead Code Elimination (DCE) and Common Subexpression Elimination (CSE) serve as the garbage collectors of the compilation process. They remove redundancy introduced by previous transformations or high-level framework artifacts, ensuring that the hardware executes only the strictly necessary logic to produce the model's output.Common Subexpression Elimination (CSE)In the context of tensor graphs, CSE identifies identical operations that produce the same value and replaces them with a single computation. This redundancy frequently appears in complex model architectures generated by Neural Architecture Search (NAS) or when high-level frameworks expand compound operators like Softmax or GELU into their primitive constituents.If a graph contains two nodes, $N_1$ and $N_2$, they are candidates for elimination if:They share the same operator type (opcode).Their inputs are identical (structural equality).Their attributes (strides, padding, dilation) are equivalent.The operations are pure (no side effects) and deterministic.Consider a scenario where a model branch calculates a variance term used for both normalization and a subsequent scaling factor. If the framework lowers this as two distinct sub-graphs, the compiler sees:$$ \begin{aligned} t_1 &= \text{mean}(X) \ t_2 &= (X - t_1)^2 \ v_1 &= \text{mean}(t_2) \ \dots \ v_2 &= \text{mean}((X - \text{mean}(X))^2) \end{aligned} $$Without CSE, the hardware computes the variance twice. The compiler's job is to detect that the subgraph producing $v_2$ is topologically identical to the one producing $v_1$, rewrite dependencies on $v_2$ to point to $v_1$, and remove the redundant nodes.Implementation via HashingEfficient CSE implementation relies on value numbering or structural hashing. The compiler traverses the graph in topological order, maintaining a hash map where the key is a composite hash of the operator and its input operands' unique identifiers.When visiting a node:Canonicalize: Ensure attributes are in a standard format (e.g., reordering commutative inputs like $A + B$ vs $B + A$).Hash: Compute $H = \text{hash}(\text{opcode}, \text{inputs}, \text{attrs})$.Lookup: Check if $H$ exists in the table of visited nodes.Hit: Replace all downstream usages of the current node with the node found in the table. Mark the current node for removal.Miss: Insert the current node into the table and proceed.A particular challenge in ML compilers involves stochastic operations. Operators like dropout or random_uniform are not pure; executing them twice yields different results. CSE passes must explicitly blocklist these non-deterministic operators to preserve the correctness of the model's stochastic behavior.digraph CSE_Example { rankdir=TB; node [shape=rect, style=filled, fontname="Helvetica", fontsize=10, color="#dee2e6", penwidth=0]; subgraph cluster_before { label="Before CSE"; style=dashed; color="#adb5bd"; in1 [label="Input X", fillcolor="#e9ecef"]; op1 [label="Conv2D", fillcolor="#a5d8ff"]; op2 [label="ReLU", fillcolor="#ffc9c9"]; op3 [label="Conv2D\n(Identical)", fillcolor="#a5d8ff"]; op4 [label="ReLU\n(Identical)", fillcolor="#ffc9c9"]; add [label="Add", fillcolor="#b2f2bb"]; in1 -> op1; op1 -> op2; in1 -> op3; op3 -> op4; op2 -> add; op4 -> add; } subgraph cluster_after { label="After CSE"; style=dashed; color="#adb5bd"; in2 [label="Input X", fillcolor="#e9ecef"]; opA [label="Conv2D", fillcolor="#a5d8ff"]; opB [label="ReLU", fillcolor="#ffc9c9"]; add2 [label="Add", fillcolor="#b2f2bb"]; in2 -> opA; opA -> opB; opB -> add2 [label="x2"]; } }Topological structure comparison before and after applying Common Subexpression Elimination. The redundant Conv2D and ReLU branch is merged, effectively halving the compute requirement for that segment.Dead Code Elimination (DCE)Dead Code Elimination removes operations that do not contribute to the final output of the graph. In deep learning, dead nodes often arise as a secondary effect of other optimizations. For example, if an algebraic simplification pass replaces $x * 0$ with a zero tensor, the subgraph that computed $x$ becomes dead code, provided $x$ is not used elsewhere.DCE is also significant when deploying models trained with auxiliary heads (e.g., Inception or R-CNN architectures) where intermediate outputs used for training supervision are irrelevant during inference.The Mark-and-Sweep ApproachStandard DCE in computational graphs operates similarly to garbage collection in memory management. It typically employs a mark-and-sweep algorithm:Identification of Roots: The compiler identifies the graph's output nodes (the "sinks"). These are inherently live.Marking: Starting from the roots, the algorithm traverses the graph backward (upstream), marking every input node as "live."Sweeping: Any node in the graph not marked as live is disconnected from the computation and removed.While simple, this process must be iterative. Removing a dead node decreases the reference count of its inputs. If an input's reference count drops to zero, it also becomes dead. Therefore, DCE is usually run in a loop or using a worklist algorithm until no further nodes can be removed.Consider the following graph state after a constant folding pass:$$ \begin{aligned} A &= \text{Load}(\text{"weights"}) \ B &= \text{Const}(0) \ C &= \text{MatMul}(X, A) \ D &= \text{Mul}(C, B) \quad \rightarrow \quad D = \text{ZerosLike}(C) \end{aligned} $$Once the compiler replaces the multiplication with a zero tensor (due to multiplication by zero), the dependency on $C$ is severed. Consequently, the MatMul operation becomes dead code. Removing MatMul makes $A$ (the weights) dead code. The final graph contains only the logic to generate the zero tensor, saving memory bandwidth by avoiding the weight load entirely.The Optimization FixpointDCE and CSE are rarely run in isolation. They are typically interleaved with other passes like constant folding and canonicalization in a loop. This is because transformations often expose new opportunities for cleanup.Constant Folding computes values known at compile time.DCE removes the original operations that generated those constants.Canonicalization rearranges graph structures (e.g., $(A+B)+C \rightarrow A+(B+C)$).CSE finds duplicates exposed by the canonical form.DCE cleans up the redundant branches disconnected by CSE.This cycle repeats until the graph reaches a fixpoint, a state where applying further passes results in no changes to the graph topology.{ "data": [ { "x": ["Initial", "Pass 1 (Const Fold)", "Pass 2 (DCE)", "Pass 3 (CSE)", "Pass 4 (DCE)"], "y": [100, 100, 85, 70, 65], "type": "scatter", "mode": "lines+markers", "name": "Node Count", "line": {"color": "#4dabf7", "width": 3}, "marker": {"size": 10, "color": "#228be6"} }, { "x": ["Initial", "Pass 1 (Const Fold)", "Pass 2 (DCE)", "Pass 3 (CSE)", "Pass 4 (DCE)"], "y": [0, 15, 15, 30, 35], "type": "bar", "name": "Nodes Removed (Cumulative)", "marker": {"color": "#ff8787"} } ], "layout": { "title": "Graph Contraction Across Optimization Passes", "font": {"family": "Helvetica"}, "yaxis": {"title": "Number of Nodes"}, "xaxis": {"title": "Optimization Sequence"}, "showlegend": true, "plot_bgcolor": "#f8f9fa", "paper_bgcolor": "white", "margin": {"l": 50, "r": 20, "t": 50, "b": 50} } }The reduction of graph complexity often occurs in steps. Notice how a DCE pass following a CSE pass (Pass 4) removes the residual nodes isolated by the logical merging of branches.Handling Side Effects and Control FlowAdvanced IRs like MLIR or Relay introduce complexity regarding control flow and memory side effects. In these environments, simple graph connectivity is insufficient to determine liveness.For example, an operation might write to a specific memory buffer that is read later. Even if the operation's direct output tensor is unused, the side effect (the memory write) preserves its liveness. Compilers handling stateful graphs (common in recurrent networks or rigorous training graphs) must perform Memory SSA (Static Single Assignment) analysis or Alias Analysis to ensure that DCE does not strip away necessary memory operations.Similarly, nodes inside conditional branches (if/else) require careful handling. A node within a branch is only dead if the branch itself is unreachable or if the node's result is unused in all possible execution paths following the merge point of the control flow. This requires the construction of a Dominator Tree to prove that a specific computation is strictly redundant across all paths.