Machine learning frameworks like PyTorch and TensorFlow operate in two distinct modes: eager execution and graph execution. Eager mode executes operations immediately as the Python interpreter reaches them. This allows for easy debugging and intuitive coding, but it presents a significant hurdle for optimization. The compiler cannot see the "full picture" of your model; it only sees one operation at a time.To perform global optimizations, such as fusing layers or reordering memory access, the system must capture the entire computation sequence into a static structure known as a computation graph. This process of converting imperative Python code into a declarative graph representation is the first important step in the ML compilation pipeline.The Eager Execution BottleneckIn standard Python execution, the interpreter interacts with the framework's dispatcher for every single operation. If you have a loop that runs a matrix multiplication a thousand times, the Python interpreter must issue the call a thousand times.Consider the following mathematical operation:$$z = \text{ReLU}(x \cdot y + b)$$In an eager environment, the execution flow looks like this:Python reads x * y.Python calls the C++ kernel for multiplication.The kernel returns a temporary tensor.Python reads ... + b.Python calls the C++ kernel for addition.The kernel returns another temporary tensor.Python calls the ReLU function.This introduces "dispatch overhead." The time spent switching between Python and the underlying C++ runtime can sometimes exceed the time spent on the actual computation, especially for small operators. To eliminate this overhead and enable operator fusion, we must capture these operations into an intermediate representation (IR) that decouples the model logic from the Python interpreter.Mechanisms of Graph CaptureThere are two primary methods to capture a graph from a dynamic framework: tracing and scripting.TracingTracing is the process of recording operations as they are executed. To trace a model, you pass a dummy input (often called an example input) through the network. The framework does not merely compute the result; it records every mathematical operation that occurs on the input tensors.The mechanism works by using proxy objects. When the framework sees a proxy tensor entering an arithmetic operation, it creates a node in the graph representing that operation instead of executing it immediately (or in addition to executing it).digraph G { rankdir=TB; node [shape=box, style=filled, fontname="Arial", fontsize=12, color="#dee2e6"]; edge [fontname="Arial", fontsize=10, color="#868e96"]; subgraph cluster_0 { label = "Python Environment"; style = filled; color = "#f8f9fa"; input [label="Dummy Input\n(Tensor)", fillcolor="#a5d8ff"]; model [label="Model Definition\n(Python Code)", fillcolor="#e9ecef"]; tracer [label="Tracer / JIT Compiler", fillcolor="#b197fc"]; } subgraph cluster_1 { label = "Captured Output"; style = filled; color = "#f8f9fa"; graph_ir [label="Static Computation\nGraph (IR)", fillcolor="#69db7c"]; } input -> model [label="feeds into"]; model -> tracer [label="execution flow"]; tracer -> graph_ir [label="records operations"]; tracer -> input [label="tracks data dependency", dir=back, style=dashed]; }The flow of data during the tracing process where Python execution is recorded into a static graph.Tracing is effective because it does not need to parse Python source code. It simply observes what happens. If your model calls a third-party library that eventually performs a PyTorch tensor operation, the tracer will catch it, provided the data flow remains connected to the input tensors.ScriptingScripting involves analyzing the Python source code directly (often inspecting the Abstract Syntax Tree or AST). It converts Python control structures like if, for, and while directly into their graph equivalents. While scripting preserves control flow logic, it is generally more brittle than tracing because it requires the Python code to strictly adhere to a subset of the language that the compiler understands.The Trace Mechanism in PracticeLet us look at how tracing handles a simple linear transformation followed by an activation. This is a standard pattern found in dense layers.import torch import torch.nn as nn class SimpleLayer(nn.Module): def __init__(self): super().__init__() self.linear = nn.Linear(10, 10) self.relu = nn.ReLU() def forward(self, x): return self.relu(self.linear(x)) # Instantiate model and dummy input model = SimpleLayer() dummy_input = torch.randn(1, 10) # Trace the model traced_graph = torch.jit.trace(model, dummy_input) # The traced_graph is now independent of the Python class print(traced_graph.graph)When torch.jit.trace is called, the compiler runs the forward method. It observes a matrix multiplication (from linear), an addition (the bias), and a ReLU operation. It outputs a graph that looks roughly like this in Intermediate Representation (IR):%1 = matmul(%x, %W)%2 = add(%1, %b)%3 = relu(%2)return %3This IR contains no Python overhead. It is a pure description of data dependencies. This graph can now be serialized, loaded into a C++ runtime, or passed to a lower-level compiler like TVM or XLA for hardware mapping.Handling Control FlowA significant limitation of tracing is its inability to capture dynamic control flow efficiently. Because tracing records the path taken during the specific run with the dummy input, it effectively "bakes in" logic decisions.Consider a function with a conditional statement:def risky_function(x): if x.sum() > 0: return x * 2 else: return x - 1If you trace this function using an input where x.sum() > 0, the tracer records only the multiplication path. The resulting graph will look like:$$y = x \cdot 2$$The if statement is completely removed. If you later run this compiled graph with an input where x.sum() < 0, it will still execute the multiplication path, leading to incorrect results.For models with static structures (like standard ResNets or Transformers where the structure doesn't change based on input data), tracing is highly effective. For models requiring dynamic logic (like recursive networks or loops with variable bounds), scripting or specialized control-flow operators are necessary.Benefits of Graph CaptureOnce the graph is captured, the compiler gains a global view of the program. This enables several optimizations that are impossible in eager mode:Dead Code Elimination: If a part of the graph does not contribute to the output, it can be pruned.Common Subexpression Elimination: Redundant calculations appearing multiple times can be computed once and reused.Algebraic Simplification: Mathematical operations can be simplified (e.g., combining transposes).Operator Fusion: This is the most critical optimization. The compiler can merge the multiplication, addition, and ReLU from our previous example into a single kernel launch, drastically reducing memory bandwidth usage.The following chart illustrates the reduction in kernel launches achieved by capturing and fusing operations.{"layout": {"title": {"text": "Impact of Graph Capture on Kernel Launches", "font": {"family": "Arial", "size": 16, "color": "#495057"}}, "xaxis": {"title": {"text": "Execution Mode", "font": {"family": "Arial", "size": 12, "color": "#868e96"}}, "showgrid": false}, "yaxis": {"title": {"text": "Number of GPU Kernel Launches", "font": {"family": "Arial", "size": 12, "color": "#868e96"}}, "showgrid": true, "gridcolor": "#e9ecef"}, "plot_bgcolor": "white", "margin": {"t": 50, "l": 50, "r": 30, "b": 50}, "height": 350, "width": 600, "showlegend": false}, "data": [{"type": "bar", "x": ["Eager Execution", "Graph Captured (Fused)"], "y": [15, 4], "marker": {"color": ["#adb5bd", "#339af0"]}, "text": ["High Dispatch Overhead", "Optimized"], "textposition": "auto"}]}Comparison of GPU kernel launches between standard eager execution and a captured, fused graph for a sequence of tensor operations.From Capture to OptimizationGraph capture is the bridge between high-level framework code and low-level hardware implementation. By successfully tracing a model, you convert a Python-dependent program into a portable, optimizable intermediate representation. This representation serves as the input for the subsequent stages of the compilation stack, where we apply graph-level transformations and loop-level optimizations to maximize hardware efficiency.