An Out of Memory (OOM) error is often interpreted as a capacity problem. The immediate assumption is that the model parameters, gradients, and optimizer states simply exceed the available VRAM. However, in high-performance distributed training, a significant class of OOM errors occurs even when the GPU reports gigabytes of free memory. This phenomenon is memory fragmentation.When training with FSDP, the system aggressively allocates and frees memory. As shards are gathered via AllGather for computation and subsequently released, the PyTorch caching allocator manages a highly dynamic memory heap. Over thousands of iterations, the memory space can become non-contiguous, resembling a block of Swiss cheese. When the allocator attempts to reserve a contiguous block for a large tensor, such as a full set of billion-parameter gradients, it may fail to find a single gap large enough, despite the total free memory being sufficient.The Caching Allocator MechanismTo understand fragmentation, we must look at the c10::cuda::CUDACachingAllocator. PyTorch avoids the high latency of native cudaMalloc and cudaFree calls by maintaining its own cache of GPU memory. When a tensor is freed, the memory is not returned to the OS; instead, it is marked as available within the PyTorch cache.If a subsequent allocation request is smaller than an available cached block, the allocator may split a large block into two smaller ones: one for the immediate request and one remaining as a free "splinter." Over time, these splinters accumulate. FSDP is particularly prone to this because it deals with variable-sized allocations: small shards for storage and large, fully materialized layers for computation.digraph MemoryFragmentation { rankdir=TB; node [shape=record, fontname="Sans-Serif", style=filled, color="#dee2e6"]; edge [color="#868e96"]; bgcolor="transparent"; subgraph cluster_0 { label="GPU Memory Heap State"; fontname="Sans-Serif"; color="#adb5bd"; struct1 [label="<f0> 200MB Alloc|<f1> 50MB Free|<f2> 300MB Alloc|<f3> 50MB Free|<f4> 100MB Alloc", fillcolor="#e9ecef"]; } request [label="Request: 80MB Contiguous Tensor", shape=box, fillcolor="#74c0fc", fontcolor="white"]; request -> struct1:f1 [label="Too Large", color="#fa5252", fontcolor="#fa5252"]; request -> struct1:f3 [label="Too Large", color="#fa5252", fontcolor="#fa5252"]; note [label="Total Free: 100MB\nMax Contiguous: 50MB\nResult: OOM", shape=note, fillcolor="#ffec99"]; struct1 -> note [style=dotted]; }A visualization of a fragmented heap where total free memory exceeds the request size, yet the allocation fails due to a lack of contiguous space.Diagnosing FragmentationThe standard nvidia-smi tool is insufficient for this analysis as it only reports the high-water mark of memory reserved by the PyTorch process, not the internal fragmentation state. To diagnose this, you need to query the allocator directly.The primary metric for identification is the ratio between reserved_memory and allocated_memory. If reserved_memory is close to the GPU capacity, but allocated_memory is significantly lower (e.g., 60-70%), fragmentation is the likely culprit.We can quantify this using the fragmentation ratio formula:$$ \text{Fragmentation Ratio} = 1 - \frac{\text{allocated_bytes}}{\text{reserved_bytes}} $$However, a more precise diagnostic approach involves capturing a memory snapshot. PyTorch provides torch.cuda.memory._dump_snapshot("snapshot.pickle"). This generates a serialized dump of the allocator's state, which can be visualized to see exactly where the "holes" in memory are located.In a healthy FSDP run, you expect to see a sawtooth pattern in memory usage. Memory spikes during the forward pass as layers are gathered and drop as they are released. If the "troughs" (minimum memory usage) gradually rise over time, or if the peak reserved memory grows without corresponding growth in allocated memory, the allocator is failing to recombine split blocks.{"layout": {"title": "Memory Profile: Healthy vs. Fragmented FSDP Run", "xaxis": {"title": "Training Steps", "showgrid": false}, "yaxis": {"title": "Memory (GB)", "showgrid": true}, "plot_bgcolor": "#f8f9fa", "paper_bgcolor": "#ffffff", "legend": {"orientation": "h", "y": -0.2}}, "data": [{"x": [1, 2, 3, 4, 5, 6, 7, 8], "y": [20, 60, 20, 60, 20, 60, 20, 60], "type": "scatter", "mode": "lines", "name": "Allocated (Healthy)", "line": {"color": "#228be6"}}, {"x": [1, 2, 3, 4, 5, 6, 7, 8], "y": [25, 70, 28, 72, 32, 75, 35, 78], "type": "scatter", "mode": "lines", "name": "Reserved (Fragmented)", "line": {"color": "#fa5252", "dash": "dot"}}, {"x": [1, 2, 3, 4, 5, 6, 7, 8], "y": [22, 62, 22, 62, 22, 62, 22, 62], "type": "scatter", "mode": "lines", "name": "Reserved (Ideal)", "line": {"color": "#40c057", "dash": "dash"}}]}The divergence between the allocated memory (blue) and the reserved memory (red) indicates increasing fragmentation, whereas an ideal setup (green) keeps reserved memory tight to the allocation requirements.Tuning the AllocatorOnce fragmentation is identified, the most effective solution is tuning the environment variable PYTORCH_CUDA_ALLOC_CONF. This variable controls the behavior of the caching allocator.The most important parameter is max_split_size_mb. By default, the allocator will split any block to satisfy a request. By setting max_split_size_mb, you prohibit the allocator from splitting blocks larger than this threshold. If a request comes in that requires a small chunk, and the only available blocks are larger than max_split_size_mb, the allocator will leave those large blocks intact and instead request a new allocation from the CUDA driver.This strategy preserves large contiguous blocks for the massive all-gather operations required by FSDP.For a model with hidden sizes in the range of 4096 to 8192, a starting configuration usually looks like this:export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512This tells PyTorch: "If you have a block larger than 512MB, do not chop it up for small variable requests." This ensures that when FSDP needs to materialize a large layer, a 512MB+ contiguous block is likely available.Handling Garbage CollectionAnother critical parameter is garbage_collection_threshold. In some edge cases, the allocator might not trigger garbage collection aggressively enough during the backward pass of a complex graph. Setting a threshold ratio helps ensure that fragmentation arising from temporary buffers is cleaned up before critical allocation spikes.For example, setting garbage_collection_threshold:0.8 will trigger a more aggressive cleanup when memory pressure reaches 80%. While this adds a minor CPU overhead for memory management, it often prevents the fatal OOM spikes seen in the middle of an epoch.Analyzing memory fragmentation transforms the OOM error from a hard stop into an optimization problem. By aligning the allocator's splitting logic with the allocation patterns of your specific FSDP configuration, you can reclaim gigabytes of effective VRAM without altering the model architecture.