Executing machine learning models, particularly large ones or those deployed on devices with limited memory, demands careful management of memory resources. While dynamic memory allocation using malloc
/free
or similar mechanisms is common in general-purpose programming, the often static structure of ML computation graphs provides an opportunity for more aggressive, compile-time optimization: static memory planning.
Unlike runtime allocation, which reacts to memory requests as they occur, static memory planning analyzes the entire computation graph ahead of time to determine the precise lifetime of every tensor (intermediate result). Based on this analysis, the compiler can devise a memory allocation strategy that reuses memory buffers whenever possible, significantly reducing the peak memory footprint required during execution. This section details the principles and techniques behind static memory planning in ML compilers.
The foundation of static memory planning is understanding when each tensor's memory buffer is actually needed. A tensor buffer becomes live when the operation producing it completes and becomes dead after the last operation that consumes it has finished reading from it. The interval between these two points is the tensor's live range.
Consider a simple sequence:
T1 = Conv(Input)
T2 = ReLU(T1)
T3 = MaxPool(T1)
T4 = Add(T2, T3)
Here, T1
is produced by Conv
and consumed by both ReLU
and MaxPool
. Its live range starts after Conv
finishes and ends only after both ReLU
and MaxPool
have completed. T2
is live from after ReLU
until after Add
finishes. T3
is live from after MaxPool
until after Add
finishes. If we allocate a separate, unique buffer for Input
, T1
, T2
, T3
, and T4
, the total memory required is the sum of their sizes. However, static analysis can reveal opportunities for reuse.
Compilers perform liveness analysis to formally determine these live ranges. This analysis typically traverses the computation graph (often in reverse execution order) propagating liveness information. The result identifies, for any point in the execution schedule, which tensors must reside in memory.
Two tensors interfere if their live ranges overlap. If tensor A
is still live when tensor B
needs to be computed and stored, A
and B
interfere and cannot share the same memory buffer. This interference relationship can be visualized using an interference graph:
An example interference graph. An edge between two tensors (e.g., T1 and T2) indicates their live ranges overlap, preventing them from sharing the same memory buffer. The exact overlaps depend on the execution schedule.
The goal of static memory planning is to assign tensors to physical memory buffers such that no two interfering tensors are assigned to the same buffer, while minimizing the total size of all allocated buffers. This problem is analogous to graph coloring on the interference graph:
While standard graph coloring aims to minimize the number of colors, memory allocation aims to minimize the peak memory usage. If all tensors were the same size, minimizing the number of buffers would achieve this. However, tensors vary greatly in size. Therefore, algorithms must minimize the maximum memory concurrently allocated across all execution points. This makes the problem more complex than standard graph coloring, though the interference graph remains a central concept.
Because finding the absolute optimal memory allocation is NP-hard, compilers employ heuristic algorithms. Common strategies include:
Greedy Allocation (e.g., First-Fit):
Offset-Based Allocation:
max(offset + size)
for all tensors). This is often solved using heuristics similar to greedy approaches but operating on offsets.Consider tensors T1
(100KB), T2
(50KB), T3
(80KB), and T4
(120KB) with the following live ranges (simplified time units):
T1
: [0, 10)T2
: [2, 12)T3
: [3, 8)T4
: [10, 15)Interference:
T1
interferes with T2
and T3
.T2
interferes with T1
and T3
.T3
interferes with T1
and T2
.T4
does not interfere with T1
, T2
, or T3
.A naive allocation requires 100 + 50 + 80 + 120 = 350KB.
Using static planning (e.g., offset-based):
The optimized peak memory is 230KB.
Comparison of peak memory usage between a naive strategy (allocating unique buffers) and static planning with memory reuse for the example scenario.
Static memory planning is a vital graph-level optimization in ML compilers. By analyzing tensor lifetimes and carefully orchestrating buffer reuse through techniques grounded in liveness analysis and interference graphs, compilers can drastically reduce the peak memory consumption of ML models. This enables the deployment of larger, more complex models on hardware with limited memory resources and can sometimes improve performance by increasing the likelihood that reused buffers remain in faster cache levels.
© 2025 ApX Machine Learning