Constant Folding and Propagation

Not every operation in deep learning models depends on input data. Many computations involve fixed parameters, hyperparameters, or configuration values that remain unchanged across every inference pass. When a compiler encounters these static subgraphs, it presents an opportunity to reduce the runtime workload by performing calculations once during compilation rather than repeating them for every input sample. This optimization technique is known as constant folding.

Constant folding analyzes the computation graph to identify operations where all inputs are static known values. The compiler evaluates these operations and replaces the original subgraph with a single node containing the pre-computed result. This process works in tandem with constant propagation, where the newly calculated constants are passed down to subsequent operations, potentially triggering a cascade of further folding opportunities.

The Mechanics of Static Evaluation

The primary goal of constant folding is to shrink the computation graph. In a typical neural network, constants appear in several forms:

Model Weights and Biases: These are frozen after training.
Hyperparameters: Values like normalization epsilon, clipping thresholds, or leaky ReLU slopes.
Shape Arithmetic: Calculations used to determine tensor dimensions for reshape or slice operations.

For example, imagine a scenario where a model includes a scaling factor derived from two constants. In Python, this might look like x * (a / b), where a and b are scalars defined in the model configuration. Without optimization, the execution engine performs the division a / b every time the model runs. Constant folding detects that a and b are immutable, computes the quotient c = a / b at compile time, and rewrites the graph to execute x * c.

The following diagram illustrates how a compiler identifies and collapses a static subgraph.

Transformation of a graph where a static addition operation is pre-calculated and replaced by a single constant node.

Identifying Foldable Nodes

To implement constant folding, the compiler must traverse the Intermediate Representation (IR) and determine which nodes are candidates for static evaluation. This is typically achieved using a dataflow analysis approach.

The compiler iterates through the graph, often in topological order. For each node, it checks the properties of its incoming edges (operands). A node is "foldable" if all its operands are constant nodes.

If a node is foldable, the compiler invokes an interpreter or a lightweight execution engine to compute the result. The complexity of this interpreter varies. For simple arithmetic, the compiler can use the host CPU's native instructions. for complex tensor operations, the compiler might need to link against a reference implementation of the operators to ensure the compile-time result matches the runtime behavior of the target hardware.

Propagation and Cascading

The power of this optimization comes from its recursive nature. When a node is folded, it effectively becomes a new constant in the graph. This new constant might be the missing piece that allows a downstream node to be folded as well.

Imagine a graph representing the calculation: $y = x + \sqrt{100} + 5$

Initial State: The graph contains inputs for $x$ , constant $100$ , and constant $5$ .
First Pass: The compiler sees the Sqrt node takes $100$ (constant) as input. It computes $\sqrt{100} = 10$ . The Sqrt node is replaced by a constant node holding $10$ .
Propagation: The downstream Add node originally took the output of Sqrt and the constant $5$ . Now, it sees two constants: $10$ and $5$ .
Second Pass: The Add node is now foldable. The compiler computes $10 + 5 = 15$ .
Final State: The graph is reduced to $y = x + 15$ .

This cascading effect requires the optimization pass to run iteratively or use a worklist algorithm until the graph reaches a stable state where no further reductions are possible.

Optimization in Batch Normalization

A practical and high-impact application of constant folding occurs in Batch Normalization (BatchNorm) layers during inference. During training, BatchNorm subtracts the batch mean and divides by the batch standard deviation. However, during inference, these statistics are frozen moving averages.

The standard inference formula for BatchNorm is:

$y = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} \gamma + \beta$

Here, $\mu$ (mean), $\sigma^2$ (variance), $\gamma$ (scale), and $\beta$ (shift) are all learned parameters, and $\epsilon$ is a configuration constant.

A naive execution would perform subtraction, addition, square root, division, multiplication, and another addition at runtime. However, because $\mu, \sigma, \epsilon, \gamma, \beta$ are all constants at inference time, the compiler can fold them into two distinct factors: a single multiplicative scale factor $w_{new}$ and a single bias term $b_{new}$ .

$w_{new} = \frac{\gamma}{\sqrt{\sigma^2 + \epsilon}}$ $b_{new} = \beta - \frac{\mu \gamma}{\sqrt{\sigma^2 + \epsilon}}$

The runtime operation simplifies to a linear transformation $y = w_{new}x + b_{new}$ . If the BatchNorm layer follows a Convolution layer, these new constants can often be folded directly into the weights of the convolution itself, removing the BatchNorm operator entirely from the runtime graph.

Shape Inference and Static Slicing

Constant folding is not limited to floating-point tensor data. It is heavily utilized for integer arithmetic related to tensor shapes. Modern dynamic frameworks often generate complex subgraphs just to calculate the shape of a tensor after a Reshape or Transpose operation.

For example, flattening a tensor from [Batch, Channels, Height, Width] to [Batch, Features] involves arithmetic on the dimension sizes. If the input image size is fixed (e.g., $224 \times 224$ ), the compiler can pre-calculate the exact size of the flattened vector. This eliminates integer arithmetic instructions from the inference kernel and allows the memory allocator to know exactly how much buffer space is required before the model runs.

Trade-offs and Limitations

While generally beneficial, constant folding has edge cases that compiler engineers must manage.

Code Size Bloat: If a constant calculation results in a tensor significantly larger than the original inputs (for example, using np.tile or torch.repeat on a small constant), folding it might store a massive array in the compiled binary. This increases the binary size and memory footprint. Compilers often use heuristics to prevent folding if the resulting constant exceeds a certain size threshold.

Precision Differences: The machine performing the compilation might have a different floating-point precision or rounding behavior than the target accelerator. Pre-calculating a value on an x86 CPU might yield a bitwise-different result than computing it on an embedded NPU or GPU. For sensitive numerical models, this divergence can effectively change the model's accuracy.

By aggressively identifying and resolving static graph components, constant folding serves as a cleanup phase, simplifying the structure to expose further optimization opportunities for memory layout and loop scheduling.

Was this section helpful?