Distributed training strategies fundamentally differ in their communication patterns. While Distributed Data Parallel (DDP) relies heavily on AllReduce to average gradients across replicas, Fully Sharded Data Parallel (FSDP) decomposes this operation to optimize memory usage. FSDP operation relies almost exclusively on two NCCL (NVIDIA Collective Communications Library) primitives: AllGather and ReduceScatter. Understanding the mechanics and cost models of these primitives is necessary for diagnosing bottlenecks in multi-node clusters where network bandwidth, not GPU compute, often dictates training throughput.
In a sharded environment, no single GPU holds the complete model weights. Before a forward or backward pass can occur for a specific layer (or group of layers), the distributed shards must be assembled into full parameters. Once computation is complete, the results must be synchronized and the temporary full parameters discarded to free memory.
This creates a distinctive "accordion" memory pattern, driven by the following cycle during a training step:
The following diagram illustrates the state transitions of a model block during these operations.
Transition of data states on a single GPU during FSDP execution. The system oscillates between low-memory sharded states and high-memory materialized states via collective communications.
AllGather is the bandwidth-heavy operation responsible for reconstructing the full model weights. Given N GPUs, where each GPU i holds a parameter shard Pi, the goal is for every GPU to possess the set {P0,P1,…,PN−1}.
In a standard ring-based implementation used by NCCL, the communication cost is significant. For a model size M (in bytes), each rank must receive the data it does not possess. The volume of data received by each rank is:
Vrx=M×NN−1
As N increases, the transfer volume approaches the total model size M. This scaling characteristic is important; adding more nodes does not reduce the AllGather communication volume per GPU for the forward pass. It merely fragments the source data more finely.
The theoretical time to complete an AllGather operation on a ring topology with bus bandwidth B is:
TAllGather=B⋅NM(N−1)
In multi-node clusters, B effectively becomes the bottleneck bandwidth of the inter-node interconnect (e.g., InfiniBand HDR/NDR or RoCEv2), which is typically an order of magnitude slower than intra-node NVLink.
When running Hybrid Sharded Data Parallel (HSDP), the communication pattern changes. PyTorch can utilize hierarchical collectives if supported by the backend, or explicit two-step operations:
This uses the high bandwidth of NVLink for the bulk of the aggregation, reducing the stress on the slower TCP/IP or InfiniBand network.
ReduceScatter acts as the functional inverse of AllGather combined with a reduction operation. During the backward pass, each GPU computes gradients for the full parameters (since it just materialized them). However, each GPU is only responsible for updating its specific shard of the optimizer state.
Therefore, the system must:
NCCL implements this as a fused operation. A standard AllReduce (used in DDP) would result in every rank holding a full copy of the global gradients. ReduceScatter avoids this redundancy, ensuring that the memory footprint for gradients remains proportional to 1/N.
The communication volume cost for ReduceScatter is identical to AllGather:
TReduceScatter=B⋅NM(N−1)
Because the backward pass involves both computation and this communication, FSDP implementations aggressively overlap ReduceScatter with the backward computation of the previous layer.
While we often model costs using Ring algorithms, NCCL dynamically selects algorithms based on topology and message size.
In FSDP training of Large Language Models, the tensors are sufficiently large that Ring (or segmented Ring) algorithms dominate. However, on extremely large clusters (e.g., >256 GPUs), the linear latency of the ring can prevent effective communication overlap.
The chart below demonstrates the theoretical efficiency drop-off as node count increases, assuming fixed model size and fixed interconnect bandwidth.
As N increases, the term NN−1 rapidly approaches 1. This indicates that after a small number of GPUs, the bandwidth requirement saturates. Adding more nodes does not decrease the communication time per GB of model data; it only distributes the compute and memory capacity.
PyTorch and NCCL do not transmit tensors one by one. Doing so would incur massive kernel launch overheads and fail to saturate the PCIe or NVLink bus. Instead, FSDP utilizes a bucketing strategy.
Tensors are flattened and concatenated into a single contiguous buffer (a "bucket") up to a configured size limit (e.g., 25MB or 50MB). The collective primitive is executed on this bucket.
Configuring bucket_cap_mb in the ShardingStrategy is an optimization vector. For multi-node training over Ethernet or InfiniBand, larger buckets (e.g., 100MB+) often yield better throughput by amortizing the protocol overhead, whereas smaller buckets might be preferred on low-latency NVLink-only setups to tighten the overlap window.
In a multi-node cluster, AllGather and ReduceScatter must traverse the inter-node fabric. If the cluster uses GPU Direct RDMA (GDR), NCCL can transfer data directly from GPU memory to the Network Interface Card (NIC), bypassing the CPU.
Without GDR, data follows the path: GPU -> CPU RAM -> NIC -> Network -> NIC -> CPU RAM -> GPU. This introduces significant latency and consumes CPU memory bandwidth.
When profiling FSDP with the PyTorch Profiler, distinct "wait" states in the NCCL streams usually indicate that the collective operations are blocked waiting for data from the slowest link in the ring. In heterogeneous clusters, a single node with a degraded NIC negotiation (e.g., running at 100Gbps instead of 400Gbps) will throttle the AllGather speed of the entire cluster to that lowest speed, regardless of how fast the other GPUs can compute.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with