Executing machine learning models, particularly large ones or those deployed on devices with limited memory, demands careful management of memory resources. While dynamic memory allocation using malloc/free or similar mechanisms is common in general-purpose programming, the often static structure of ML computation graphs provides an opportunity for more aggressive, compile-time optimization: static memory planning.Unlike runtime allocation, which reacts to memory requests as they occur, static memory planning analyzes the entire computation graph ahead of time to determine the precise lifetime of every tensor (intermediate result). Based on this analysis, the compiler can devise a memory allocation strategy that reuses memory buffers whenever possible, significantly reducing the peak memory footprint required during execution. This section details the principles and techniques behind static memory planning in ML compilers.Understanding Tensor LifetimesThe foundation of static memory planning is understanding when each tensor's memory buffer is actually needed. A tensor buffer becomes live when the operation producing it completes and becomes dead after the last operation that consumes it has finished reading from it. The interval between these two points is the tensor's live range.Consider a simple sequence:T1 = Conv(Input)T2 = ReLU(T1)T3 = MaxPool(T1)T4 = Add(T2, T3)Here, T1 is produced by Conv and consumed by both ReLU and MaxPool. Its live range starts after Conv finishes and ends only after both ReLU and MaxPool have completed. T2 is live from after ReLU until after Add finishes. T3 is live from after MaxPool until after Add finishes. If we allocate a separate, unique buffer for Input, T1, T2, T3, and T4, the total memory required is the sum of their sizes. However, static analysis can reveal opportunities for reuse.Liveness Analysis and InterferenceCompilers perform liveness analysis to formally determine these live ranges. This analysis typically traverses the computation graph (often in reverse execution order) propagating liveness information. The result identifies, for any point in the execution schedule, which tensors must reside in memory.Two tensors interfere if their live ranges overlap. If tensor A is still live when tensor B needs to be computed and stored, A and B interfere and cannot share the same memory buffer. This interference relationship can be visualized using an interference graph:Nodes: Represent the tensors (or the memory buffers required for them).Edges: Connect two nodes if the corresponding tensors interfere (their live ranges overlap).<!-- end list -->graph Interference { rankdir=LR; node [shape=circle, style=filled, fillcolor="#a5d8ff"]; edge [color="#495057"]; T1; T2; T3; T4; Input; T1 -- T2 [label="overlap"]; T1 -- T3 [label="overlap"]; T2 -- T3 [label="overlap"]; T2 -- T4 [label="overlap"]; T3 -- T4 [label="overlap"]; Input -- T1 [label="overlap"]; Input -- T2 [label="overlap"]; Input -- T3 [label="overlap"]; }An example interference graph. An edge between two tensors (e.g., T1 and T2) indicates their live ranges overlap, preventing them from sharing the same memory buffer. The exact overlaps depend on the execution schedule.Memory Allocation via SharingThe goal of static memory planning is to assign tensors to physical memory buffers such that no two interfering tensors are assigned to the same buffer, while minimizing the total size of all allocated buffers. This problem is analogous to graph coloring on the interference graph:Assign a "color" (representing a distinct memory buffer) to each node (tensor).Constraint: Nodes connected by an edge (interfering tensors) must receive different colors.Objective: Minimize the "cost" associated with the colors used.While standard graph coloring aims to minimize the number of colors, memory allocation aims to minimize the peak memory usage. If all tensors were the same size, minimizing the number of buffers would achieve this. However, tensors vary greatly in size. Therefore, algorithms must minimize the maximum memory concurrently allocated across all execution points. This makes the problem more complex than standard graph coloring, though the interference graph remains a central concept.Because finding the absolute optimal memory allocation is NP-hard, compilers employ heuristic algorithms. Common strategies include:Greedy Allocation (e.g., First-Fit):Process tensors in a specific order (e.g., sorted by start time of live range, end time, or size).For each tensor, iterate through the already allocated buffers. Assign the tensor to the first buffer that is large enough and currently available (i.e., does not hold any interfering tensors during this tensor's live range).If no suitable buffer is found, allocate a new buffer.Variations like "Best-Fit" (choosing the smallest sufficient buffer) or "Worst-Fit" exist, trading off computational complexity for potential fragmentation reduction.Offset-Based Allocation:Instead of distinct logical buffers, maintain a single (or few) large memory pool(s).Calculate the required size of the pool based on an estimated peak usage.Assign each tensor a specific byte offset within the pool.The core challenge remains finding non-overlapping offset assignments for interfering tensors while minimizing the total pool size (max(offset + size) for all tensors). This is often solved using heuristics similar to greedy approaches but operating on offsets.Calculating Peak Memory and OffsetsConsider tensors T1 (100KB), T2 (50KB), T3 (80KB), and T4 (100KB) with the following live ranges (simplified time units):T1: [0, 10)T2: [2, 12)T3: [3, 8)T4: [10, 15)Interference:T1 interferes with T2 and T3.T2 interferes with T1 and T3.T3 interferes with T1 and T2.T4 does not interfere with T1, T2, or T3.A naive allocation requires 100 + 50 + 80 + 100 = 330KB.Using static planning (e.g., offset-based):Time [0, 2): Live: {T1}. Mem: 100KB. Allocate T1 at offset 0. Pool size: 100KB.Time [2, 3): Live: {T1, T2}. Mem: 100+50=150KB. T1 is at offset 0. Allocate T2 at offset 100. Pool size: 150KB.Time [3, 8): Live: {T1, T2, T3}. Mem: 100+50+80=230KB. T1 at 0, T2 at 100. Allocate T3 at offset 150. Pool size: 230KB (Peak).Time [8, 10): Live: {T1, T2}. Mem: 100+50=150KB. T3's space [150, 230) is now free.Time [10, 12): Live: {T2, T4}. Mem: 50+100=150KB. T1's space [0, 100) is free. T4 needs 100KB. It can reuse T1's space (offset 0) for a perfect fit. Pool size remains 230KB.Time [12, 15): Live: {T4}. Mem: 100KB. T2's space [100, 150) is free. T4 is at offset 0. Pool size remains 230KB.The optimized peak memory is 230KB, a significant reduction from the naive approach.{"layout": {"title": "Memory Usage: Naive vs. Static Planning", "xaxis": {"title": "Allocation Strategy"}, "yaxis": {"title": "Peak Memory (KB)"}, "barmode": "group"}, "data": [{"type": "bar", "name": "Naive Allocation", "x": ["Peak Memory"], "y": [330], "marker": {"color": "#f03e3e"}}, {"type": "bar", "name": "Static Planning", "x": ["Peak Memory"], "y": [230], "marker": {"color": "#1c7ed6"}}]}Comparison of peak memory usage between a naive strategy (allocating unique buffers) and static planning with memory reuse for the example scenario.Practical NotesAlignment: Hardware often requires memory accesses to be aligned to specific boundaries (e.g., 32, 64, or 128 bytes). The memory planner must ensure that the assigned offset for each tensor respects its required alignment, potentially leaving small unused gaps (internal fragmentation).Interaction with Fusion: Operator fusion reduces the number of intermediate tensors, simplifying the interference graph and often significantly reducing peak memory requirements even before explicit planning. Memory planning typically runs after fusion.Control Flow: Graphs with conditional branches or loops complicate static liveness analysis, as the exact execution path may not be known at compile time. Conservative analysis (assuming a tensor is live if it's live on any possible path) is often used, potentially overestimating memory needs. More advanced techniques might involve planning for specific common paths or using runtime information.Recomputation vs. Storage: In extremely memory-constrained situations, it might be better to recompute a tensor later rather than keeping it live in memory. This represents a trade-off between computation and memory, usually handled by a separate optimization pass.Static memory planning is a critical graph-level optimization in ML compilers. By analyzing tensor lifetimes and carefully orchestrating buffer reuse through techniques grounded in liveness analysis and interference graphs, compilers can drastically reduce the peak memory consumption of ML models. This enables the deployment of larger, more complex models on hardware with limited memory resources and can sometimes improve performance by increasing the likelihood that reused buffers remain in faster cache levels.