The network interconnect often dictates the upper bound of throughput in high-performance distributed training. While modern GPUs possess massive arithmetic capability, that potential remains untapped if the device spends significant cycles waiting for data synchronization. Minimizing "exposed communication" is the primary objective of performance engineering in Fully Sharded Data Parallel (FSDP) training.
Exposed communication occurs when the GPU's compute engines are idle while waiting for the network to complete a collective operation, such as AllGather or ReduceScatter. Ideally, these network operations should execute in the background, fully masked by the computation of the previous or subsequent layer. Achieving this requires a precise orchestration of CUDA streams and careful tuning of FSDP's prefetching policies.
PyTorch manages concurrency on NVIDIA GPUs using CUDA streams. A stream is a sequence of operations that execute in order, but operations in different streams can run concurrently. In a standard FSDP configuration, there are at least two critical streams active:
For effective overlap, the scheduler must dispatch a network kernel on the NCCL stream immediately after the dependencies for that operation are resolved, without blocking the Compute stream.
The following diagram illustrates the difference between sequential execution and optimized overlapping during the backward pass. In the optimized flow, the communication for Layer N−1 occurs simultaneously with the computation of Layer N.
Comparison of sequential versus pipelined execution flows in the backward pass. The optimized flow allows the compute stream to proceed without waiting for the network, provided the communication finishes before the data is strictly required.
When analyzing a trace in the PyTorch Profiler, you typically view the timeline with the "GPU Kernel" view. To identify communication bottlenecks, locate the NCCL kernels (often labeled ncclDevKernel_AllGather_... or ncclDevKernel_ReduceScatter_...).
You must verify what is happening on the Compute Stream vertically aligned with these NCCL kernels.
GEMM or element-wise kernels on the Compute Stream directly above the NCCL kernels. The GPU utilization remains at 100% because the arithmetic units are busy while the data moves over the interconnect.The magnitude of this inefficiency can be calculated by summing the duration of these gaps. The effective step time Tstep is defined as:
Tstep=Tcompute+Texposed_comm
where Texposed_comm is the portion of network time not hidden by computation. Your goal is to drive Texposed_comm to zero.
The primary knob for controlling overlap in FSDP is the backward_prefetch policy. This setting determines when the system requests the parameters for the next layer (in the backward sequence).
AllGather for layer N−1 immediately when the computation for layer N begins. This maximizes the overlap window.However, BACKWARD_PRE increases peak memory pressure because two sets of sharded parameters (current and next) must be materialized in VRAM simultaneously. If you observe OOM errors, you might be forced to revert to BACKWARD_POST or enable CPU offloading, accepting the performance penalty.
The ability to hide communication is strictly limited by the ratio of computation to communication for a given layer. This relationship depends on the model architecture and the batch size.
For a Transformer block, the communication volume is proportional to the parameter size P, while the computation is proportional to P×B (where B is the batch size). As you increase the batch size, the computation time grows linearly, but the communication time remains roughly constant (assuming fixed bandwidth).
Compute Time∝FLOPSB⋅P Comm Time∝BandwidthP
Therefore, larger batch sizes make it easier to hide communication. If your profiling reveals significant exposed communication, and you cannot optimize the network further, increasing the batch size (or gradient accumulation steps) is often the most effective software-level fix.
The following chart demonstrates how the proportion of exposed communication decreases as the local batch size increases, eventually plateauing when the network is fully hidden.
As batch size increases, compute time eventually exceeds communication time, eliminating the exposed communication stall (yellow bars).
Sometimes, profiling reveals exposed communication even when the math suggests overlap should be possible. This is often due to CPU-side inefficiencies.
FSDP relies on the CPU to dispatch kernels. If the CPU falls behind (CPU-bound), it cannot issue the AllGather instruction fast enough to fill the pipeline. This appears in the trace as "gaps" between kernels on the GPU timeline, often accompanied by long cudaStreamSynchronize calls on the CPU thread.
To mitigate this:
DataLoader is not bottlenecking the main process.nn.Module code.torch.compile (in PyTorch 2.x) to fuse kernels and reduce Python interpreter overhead, freeing up CPU cycles to manage the NCCL streams more effectively.Analyzing overlap is an iterative process. You identify the largest gap, tune the prefetch depth or batch size, and re-profile until the GPU compute stream shows a solid, unbroken block of execution.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with