Machine learning frameworks like PyTorch and TensorFlow operate in two distinct modes: eager execution and graph execution. Eager mode executes operations immediately as the Python interpreter reaches them. This allows for easy debugging and intuitive coding, but it presents a significant hurdle for optimization. The compiler cannot see the "full picture" of your model; it only sees one operation at a time.
To perform global optimizations, such as fusing layers or reordering memory access, the system must capture the entire computation sequence into a static structure known as a computation graph. This process of converting imperative Python code into a declarative graph representation is the first important step in the ML compilation pipeline.
In standard Python execution, the interpreter interacts with the framework's dispatcher for every single operation. If you have a loop that runs a matrix multiplication a thousand times, the Python interpreter must issue the call a thousand times.
Consider the following mathematical operation:
z=ReLU(x⋅y+b)
In an eager environment, the execution flow looks like this:
x * y.... + b.This introduces "dispatch overhead." The time spent switching between Python and the underlying C++ runtime can sometimes exceed the time spent on the actual computation, especially for small operators. To eliminate this overhead and enable operator fusion, we must capture these operations into an intermediate representation (IR) that decouples the model logic from the Python interpreter.
There are two primary methods to capture a graph from a dynamic framework: tracing and scripting.
Tracing is the process of recording operations as they are executed. To trace a model, you pass a dummy input (often called an example input) through the network. The framework does not merely compute the result; it records every mathematical operation that occurs on the input tensors.
The mechanism works by using proxy objects. When the framework sees a proxy tensor entering an arithmetic operation, it creates a node in the graph representing that operation instead of executing it immediately (or in addition to executing it).
The flow of data during the tracing process where Python execution is recorded into a static graph.
Tracing is effective because it does not need to parse Python source code. It simply observes what happens. If your model calls a third-party library that eventually performs a PyTorch tensor operation, the tracer will catch it, provided the data flow remains connected to the input tensors.
Scripting involves analyzing the Python source code directly (often inspecting the Abstract Syntax Tree or AST). It converts Python control structures like if, for, and while directly into their graph equivalents. While scripting preserves control flow logic, it is generally more brittle than tracing because it requires the Python code to strictly adhere to a subset of the language that the compiler understands.
Let us look at how tracing handles a simple linear transformation followed by an activation. This is a standard pattern found in dense layers.
import torch
import torch.nn as nn
class SimpleLayer(nn.Module):
def __init__(self):
super().__init__()
self.linear = nn.Linear(10, 10)
self.relu = nn.ReLU()
def forward(self, x):
return self.relu(self.linear(x))
# Instantiate model and dummy input
model = SimpleLayer()
dummy_input = torch.randn(1, 10)
# Trace the model
traced_graph = torch.jit.trace(model, dummy_input)
# The traced_graph is now independent of the Python class
print(traced_graph.graph)
When torch.jit.trace is called, the compiler runs the forward method. It observes a matrix multiplication (from linear), an addition (the bias), and a ReLU operation. It outputs a graph that looks roughly like this in Intermediate Representation (IR):
%1 = matmul(%x, %W)%2 = add(%1, %b)%3 = relu(%2)return %3This IR contains no Python overhead. It is a pure description of data dependencies. This graph can now be serialized, loaded into a C++ runtime, or passed to a lower-level compiler like TVM or XLA for hardware mapping.
A significant limitation of tracing is its inability to capture dynamic control flow efficiently. Because tracing records the path taken during the specific run with the dummy input, it effectively "bakes in" logic decisions.
Consider a function with a conditional statement:
def risky_function(x):
if x.sum() > 0:
return x * 2
else:
return x - 1
If you trace this function using an input where x.sum() > 0, the tracer records only the multiplication path. The resulting graph will look like:
y=x⋅2
The if statement is completely removed. If you later run this compiled graph with an input where x.sum() < 0, it will still execute the multiplication path, leading to incorrect results.
For models with static structures (like standard ResNets or Transformers where the structure doesn't change based on input data), tracing is highly effective. For models requiring dynamic logic (like recursive networks or loops with variable bounds), scripting or specialized control-flow operators are necessary.
Once the graph is captured, the compiler gains a global view of the program. This enables several optimizations that are impossible in eager mode:
The following chart illustrates the reduction in kernel launches achieved by capturing and fusing operations.
Comparison of GPU kernel launches between standard eager execution and a captured, fused graph for a sequence of tensor operations.
Graph capture is the bridge between high-level framework code and low-level hardware implementation. By successfully tracing a model, you convert a Python-dependent program into a portable, optimizable intermediate representation. This representation serves as the input for the subsequent stages of the compilation stack, where we apply graph-level transformations and loop-level optimizations to maximize hardware efficiency.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with