Performance regression in distributed systems is rarely caused by a single catastrophic failure. Instead, it emerges from micro-second delays in kernel scheduling, inefficient memory allocation patterns, or subtle synchronization barriers between the host CPU and the GPU device. In this practical case work, we analyze a training run of a 7B parameter transformer model distributed across 4 nodes (32 A100 GPUs) that is functionally correct but achieving only 28% Model Flops Utilization (MFU).The objective is to diagnose the bottleneck using the PyTorch Profiler, identify the specific resource contention, and implement configuration changes to approach the target 45-50% MFU range typical for well-tuned FSDP setups.Phase 1: Baseline Trace AcquisitionBefore applying optimizations, we must establish a baseline performance profile. Relying solely on iterations per second is insufficient because it aggregates computation, communication, and data loading into a single scalar. We require a timeline view of the execution.We instrument the training loop to capture a single step after the warmup phase. Capturing the first few steps is often misleading due to compilation overhead and allocator initialization.import torch.profiler # Context manager setup for profiling with torch.profiler.profile( activities=[ torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA, ], schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=1), on_trace_ready=torch.profiler.tensorboard_trace_handler('./log_dir'), record_shapes=True, profile_memory=True, with_stack=True ) as prof: for step, batch in enumerate(dataloader): train_step(batch) prof.step()Upon analyzing the generated trace in the Chrome Trace Viewer, the timeline reveals distinct gaps between GPU kernels. In an ideal FSDP execution, the GPU Compute stream should remain saturated with GEMM (General Matrix Multiplication) operations, while the NCCL communication occurs on a separate CUDA stream.The baseline trace shows the following pattern:Compute Stream: GEMM (Layer N) executes.Gap: The GPU becomes idle for 15ms.Compute Stream: GEMM (Layer N+1) executes.This gap indicates that the parameters for Layer N+1 were not available when Layer N completed. The computation was blocked waiting for the AllGather collective to finish. This is a failure of communication-computation overlap.Phase 2: Diagnosing Overlap FailuresFSDP attempts to hide communication latency by prefetching the next set of sharded parameters while the current layer computes. When we observe gaps, it implies the prefetching was triggered too late or the network bandwidth was insufficient to complete the transfer in the time it took to perform the computation.We can visualize the difference between serial execution and optimal overlap. The following diagram details the stream interactions we aim to achieve versus what the baseline trace exposes.digraph G { rankdir=TB; node [style=filled, fontname="Helvetica", shape=box, color="#dee2e6"]; edge [fontname="Helvetica", color="#868e96"]; subgraph cluster_serial { label = "Baseline: Serial Execution (Blocking)"; style=filled; color="#f8f9fa"; node [fillcolor="#ffc9c9"]; start_s [label="Start Step", shape=circle, width=0.5, fixedsize=true]; fetch_n [label="AllGather Layer N\n(Communication)"]; compute_n [label="Forward Layer N\n(Computation)", fillcolor="#a5d8ff"]; wait_n [label="Stream Sync\n(Idle Gap)", fillcolor="#dee2e6", style=dashed]; fetch_n1 [label="AllGather Layer N+1\n(Communication)"]; compute_n1 [label="Forward Layer N+1\n(Computation)", fillcolor="#a5d8ff"]; start_s -> fetch_n; fetch_n -> compute_n; compute_n -> wait_n; wait_n -> fetch_n1; fetch_n1 -> compute_n1; } subgraph cluster_overlap { label = "Target: Overlapped Execution"; style=filled; color="#f8f9fa"; start_o [label="Start Step", shape=circle, width=0.5, fixedsize=true]; subgraph cluster_stream1 { label = "Compute Stream"; color="#e9ecef"; node [fillcolor="#a5d8ff"]; c_n [label="Forward Layer N"]; c_n1 [label="Forward Layer N+1"]; } subgraph cluster_stream2 { label = "NCCL Stream"; color="#e9ecef"; node [fillcolor="#ffc9c9"]; g_n [label="AllGather Layer N"]; g_n1 [label="AllGather Layer N+1"]; } start_o -> g_n; g_n -> c_n [constraint=false]; c_n -> g_n1 [style=dotted, label="Prefetch Trigger"]; g_n1 -> c_n1 [constraint=false]; } }The diagram compares the serialized execution flow observed in the baseline against the target parallel stream execution where communication happens concurrently with computation.The trace analysis confirms that while backward_prefetch=BackwardPrefetch.BACKWARD_PRE was enabled, the limit_all_gathers setting was too restrictive. FSDP restricts the number of concurrent AllGathers to prevent Out-Of-Memory (OOM) errors. If the limit is set to true (default in some versions), FSDP waits for the current block to finish computation and release memory before fetching the next block. This saves memory but enforces serialization.Phase 3: Tuning Throughput and MemoryTo close the gap, we must relax the memory constraints to allow the network scheduler to fetch ahead. However, simply enabling aggressive prefetching can lead to memory fragmentation. The PyTorch caching allocator may fail to find a contiguous block for the incoming parameters if the heap is fragmented, triggering expensive cudaFree and cudaMalloc calls.In the trace, scan the "Memory" timeline. If you see frequent vertical spikes in the allocated memory coinciding with short CPU-side gaps, the allocator is thrashing.Optimization Actions:Adjust FSDP Strategy: Explicitly set limit_all_gathers=False if VRAM permits. This allows the next layer's parameters to be materialized in memory before the current layer finishes.Allocator Tuning: Set PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512. This prevents the allocator from splitting large blocks into unusable small fragments, reducing the overhead of memory management during the critical path.We apply these changes and re-profile.Phase 4: Calculating MFU and Final VerificationAfter applying the configuration changes, we execute the training run again. We recalculate the Model Flops Utilization (MFU) to verify the improvement.The formula for MFU depends on the achieved throughput (Tokens/sec) and the model architecture. For a transformer model, the approximate FLOPs per token is given by:$$ \text{FLOPs}_{\text{per_token}} \approx 6 \times P $$Where $P$ is the number of parameters. For a 7B model ($7 \times 10^9$):$$ \text{FLOPs}_{\text{per_token}} \approx 42 \times 10^9 $$If our optimized run achieves a throughput of 2200 tokens/sec/GPU, the calculation follows:Achieved TFLOPS: $$ \text{Achieved} = \frac{2200 \times 42 \times 10^9}{10^{12}} \approx 92.4 \text{ TFLOPS} $$MFU: Using an NVIDIA A100 (SXM4) with a peak BF16 tensor core performance of approximately 312 TFLOPS (without sparsity): $$ \text{MFU} = \frac{92.4}{312} \approx 29.6% $$Note: While 29.6% is an improvement over the baseline, reaching 50% often requires activation checkpointing (reducing memory I/O) and ensuring kernel fusion (using torch.compile).The chart below illustrates the performance progression from our untuned baseline through the optimization steps.{ "layout": { "title": "Impact of Optimizations on Throughput (A100-80GB)", "xaxis": { "title": "Optimization Stage" }, "yaxis": { "title": "Throughput (Tokens/sec/GPU)" }, "barmode": "group", "template": "simple_white", "width": 600, "height": 400 }, "data": [ { "type": "bar", "x": ["Baseline (Serial)", "Overlap Tuned", "Mem + Compile"], "y": [1850, 2400, 3100], "marker": { "color": ["#adb5bd", "#4dabf7", "#228be6"] }, "text": ["1850", "2400", "3100"], "textposition": "auto" } ] }Improvements in throughput measured across three stages: the initial baseline, after fixing communication overlap, and after resolving memory fragmentation and enabling compilation.Summary of FindingsIn this case work, the initial low MFU was not due to slow compute kernels, but rather the idle time imposed by serialized communication. By using the profiler to inspect the timeline, we identified that the GPU was starved of data. Adjusting the prefetching policy allowed the NCCL stream to operate concurrently with the compute stream, recovering lost cycles. Finally, stabilizing the memory allocator ensured that this overlap was not interrupted by memory management overhead. These adjustments are specific to the relationship between the model size and the available VRAM; larger models might require re-enabling limit_all_gathers and accepting a slight performance penalty to avoid OOM.