The limitation of Distributed Data Parallel (DDP) is in its fundamental redundancy. In a DDP setup with $N$ GPUs, the system maintains $N$ identical copies of the model parameters, gradients, and optimizer states. While this allows for parallel computation of the backward pass, it creates a memory wall: the maximum model size is strictly capped by the VRAM of a single GPU, regardless of the total cluster capacity.To break this barrier, we employ the Zero Redundancy Optimizer (ZeRO) algorithmic strategies. ZeRO eliminates this redundancy by partitioning the model state across data parallel processes. Instead of replicating the full state, each device owns a distinct shard of the data. The optimization occurs in three progressive stages, each trading increased communication complexity for substantial memory savings.The Memory Hierarchy of TrainingBefore partitioning, we must quantify what consumes GPU memory. For a model with $\Psi$ parameters trained with mixed precision (FP16/BF16) and the Adam optimizer, the memory footprint is dominated by three components:Optimizer States ($O$): The most significant consumer. Adam maintains an FP32 copy of the parameters (master weights), plus momentum and variance buffers. This accounts for approximately 12 bytes per parameter ($4+4+4$).Gradients ($G$): Stored in FP16/BF16 during the backward pass. Accounts for 2 bytes per parameter.Parameters ($P$): The model weights themselves, used for the forward and backward passes. Accounts for 2 bytes per parameter.In a standard DDP configuration, every GPU holds all $16\Psi$ bytes. ZeRO targets these components sequentially.ZeRO Stage 1: Optimizer State ShardingStage 1 ($P_{os}$) targets the largest memory consumer: the optimizer states. In this configuration, the parameters ($P$) and gradients ($G$) remain replicated across all devices, preserving the communication pattern of DDP for the forward and backward passes. However, the optimizer step is sharded.If you have $N_d$ devices, the optimizer state is split into $N_d$ equal partitions. Each $i$-th device updates only its specific shard of the parameters. At the end of the step, an AllGather operation synchronizes the updated parameters across all devices.Memory consumption per device drops from $2\Psi + 2\Psi + 12\Psi$ to approximately:$$ \text{Mem}_{\text{Stage1}} = 2\Psi + 2\Psi + \frac{12\Psi}{N_d} $$For large clusters, this reduces memory usage by nearly 75% compared to DDP, as the optimizer state term approaches zero.ZeRO Stage 2: Gradient ShardingStage 2 ($P_{os+g}$) extends sharding to the gradients. In standard DDP, gradients are computed locally and then synchronized using an AllReduce operation. AllReduce is logically equivalent to a ReduceScatter followed by an AllGather.ZeRO Stage 2 modifies this flow. After the backward pass, the system performs a ReduceScatter operation. Each GPU receives and aggregates only the gradients corresponding to the partition of the parameters it is responsible for updating. It then discards the rest.Because the optimizer states are already sharded (from Stage 1), each GPU now has exactly what it needs to update its specific parameter shard: the specific optimizer state and the specific accumulated gradients.Memory consumption becomes:$$ \text{Mem}_{\text{Stage2}} = 2\Psi + \frac{2\Psi}{N_d} + \frac{12\Psi}{N_d} $$This stage yields significant gains with minimal communication overhead, as ReduceScatter is a primitive already inherent in the AllReduce operation used by DDP.ZeRO Stage 3: Parameter ShardingStage 3 ($P_{os+g+p}$) is the core mechanism behind what is colloquially termed "Full" FSDP. In this stage, the model parameters themselves are sharded. No single GPU holds the complete model weights at rest.This introduces a new challenge: computing the forward and backward passes requires the full weights for the specific layers being processed. ZeRO-3 solves this through temporal materialization.Forward Pass: Before a layer computes its output, FSDP triggers an AllGather to fetch the missing parameter shards from other GPUs. The layer computes, and the parameters are immediately freed (discarded) to save memory.Backward Pass: The system again AllGathers the full parameters to compute gradients, then discards them.This approach allows training models that are larger than the aggregate memory of the entire cluster is not possible, but it allows training models that are as large as the sum of all GPU memory, minus activation overheads.The memory per device is reduced to the theoretical minimum:$$ \text{Mem}_{\text{Stage3}} = \frac{2\Psi + 2\Psi + 12\Psi}{N_d} = \frac{16\Psi}{N_d} $$The following visualization demonstrates the memory distribution across 4 devices under different strategies. Note how Stage 3 (FSDP) distributes the entire stack evenly.digraph G { rankdir=TB; node [shape=record, style=filled, fontname="Helvetica", fontsize=10]; edge [fontname="Helvetica", fontsize=8]; bgcolor="transparent"; subgraph cluster_0 { label="DDP (Replicated)"; style=dashed; color="#adb5bd"; fontcolor="#495057"; struct1 [label="{Params (2)|Grads (2)|Opt State (12)}", color="#a5d8ff", fillcolor="#a5d8ff"]; struct2 [label="{Params (2)|Grads (2)|Opt State (12)}", color="#a5d8ff", fillcolor="#a5d8ff"]; struct3 [label="{Params (2)|Grads (2)|Opt State (12)}", color="#a5d8ff", fillcolor="#a5d8ff"]; struct4 [label="{Params (2)|Grads (2)|Opt State (12)}", color="#a5d8ff", fillcolor="#a5d8ff"]; } subgraph cluster_1 { label="ZeRO-3 (Fully Sharded)"; style=dashed; color="#adb5bd"; fontcolor="#495057"; s1 [label="{P_1|G_1|O_1}", color="#96f2d7", fillcolor="#96f2d7"]; s2 [label="{P_2|G_2|O_2}", color="#96f2d7", fillcolor="#96f2d7"]; s3 [label="{P_3|G_3|O_3}", color="#96f2d7", fillcolor="#96f2d7"]; s4 [label="{P_4|G_4|O_4}", color="#96f2d7", fillcolor="#96f2d7"]; } // Invisible edges for alignment struct1 -> s1 [style=invis]; struct2 -> s2 [style=invis]; struct3 -> s3 [style=invis]; struct4 -> s4 [style=invis]; }Comparison of state allocation between DDP and ZeRO-3 (FSDP) on a 4-GPU cluster. DDP replicates the full state; FSDP shards all components.Quantitative Impact on MemoryThe choice of stage dramatically shifts the maximum trainable model size. While Stage 1 and 2 offer substantial reductions, Stage 3 enables linear scaling with the number of GPUs.The chart below illustrates the memory consumption per GPU for a theoretical 10-billion parameter model (requiring approx 160GB total state) across an 8-GPU cluster.{"layout": {"title": "Memory Footprint per GPU (10B Param Model, 8 GPUs)", "barmode": "stack", "template": "simple_white", "font": {"family": "Helvetica"}, "xaxis": {"title": "Sharding Strategy"}, "yaxis": {"title": "Memory (GB)"}, "showlegend": true, "legend": {"orientation": "h", "yanchor": "bottom", "y": 1.02, "xanchor": "right", "x": 1}}, "data": [{"type": "bar", "name": "Parameters (FP16)", "x": ["DDP", "ZeRO-1", "ZeRO-2", "ZeRO-3"], "y": [20, 20, 20, 2.5], "marker": {"color": "#339af0"}}, {"type": "bar", "name": "Gradients (FP16)", "x": ["DDP", "ZeRO-1", "ZeRO-2", "ZeRO-3"], "y": [20, 20, 2.5, 2.5], "marker": {"color": "#fcc2d7"}}, {"type": "bar", "name": "Optimizer (FP32+)", "x": ["DDP", "ZeRO-1", "ZeRO-2", "ZeRO-3"], "y": [120, 15, 15, 15], "marker": {"color": "#69db7c"}}]}Memory usage breakdown per GPU. Note that ZeRO-1 provides the largest single drop in memory usage by sharding the optimizer state, while ZeRO-3 minimizes the footprint of all components.Communication Trade-offsThese memory benefits are not free; the currency is network bandwidth.DDP: Requires AllReduce on gradients. Communication volume is $2\Psi$ per step (send + receive).ZeRO-3: Requires AllGather on parameters (forward pass), AllGather on parameters (backward pass), and ReduceScatter on gradients. The total communication volume increases to approximately $3\Psi$.In bandwidth-restricted environments (like standard Ethernet), the extra latency of fetching parameters on demand in Stage 3 can throttle compute throughput. This makes the configuration of high-speed interconnects like NVLink or InfiniBand essential for Stage 3 training, a topic we will address in the Multi-Node Networking chapter.Implications for FSDP ImplementationIn PyTorch FSDP, these strategies are not always mutually exclusive rigid modes but are configured via the sharding_strategy parameter.ShardingStrategy.FULL_SHARD maps to ZeRO-3.ShardingStrategy.SHARD_GRAD_OP maps to ZeRO-2.ShardingStrategy.NO_SHARD behaves like DDP.Understanding these stages allows you to select the correct strategy based on your hardware constraints. If your model fits in memory with ZeRO-2, it is often preferred over ZeRO-3 due to lower communication overhead. However, for the terabyte-scale models that define modern AI, ZeRO-3 is often the only viable path forward.