Training large language models introduces a memory challenge that linear scaling of hardware cannot solve alone. When using standard Distributed Data Parallel (DDP), the training process hits a hard ceiling defined by the VRAM of a single GPU. Understanding the exact composition of this memory footprint is necessary for engineering systems capable of training models with billions or trillions of parameters.
To optimize memory usage, we must first quantify the cost of training a single parameter. It is a common misconception that a model with parameter count Ψ requires 4Ψ bytes (assuming 32-bit floats) or 2Ψ bytes (assuming 16-bit floats) of memory. In practice, the memory consumption during training with the Adam optimizer and mixed precision is significantly higher.
In a standard mixed-precision training pipeline (using FP16 or BF16 for computation and FP32 for weight updates), the system must maintain several copies of the model state. For every individual parameter in the model, the memory allocation consists of:
Summing these components yields the memory constant for mixed-precision training:
Mparam=2+2+4+4+4=16 bytes
Therefore, a model with Ψ parameters requires 16Ψ bytes of static memory. A 7 billion parameter model, often considered "small" in modern LLM contexts, requires 7×109×16 bytes, or approximately 112 GB of VRAM, merely to load the weights and optimizer states. This exceeds the capacity of an NVIDIA A100 (80GB) before a single token is processed.
DDP functions by replicating the entire model state across every worker in the cluster. If you deploy a cluster of N GPUs, DDP creates N identical copies of the model parameters, gradients, and optimizer states. The communication step in DDP (AllReduce) synchronizes gradients across workers, but it does not reduce the memory footprint on any individual device.
The efficiency of DDP degrades as model size increases. While DDP allows you to scale batch size by adding GPUs, it does not allow you to scale model size. The memory requirement per GPU remains constant regardless of the cluster size:
MemoryDDP=16Ψ+Activation Memory+Fragmentation
This architecture results in massive memory redundancy. In a cluster with 16 GPUs training a 1 billion parameter model (16 GB footprint), the total cluster memory usage is 16×16 GB=256 GB. However, the unique information stored is only 16 GB. The remaining 240 GB is duplicate data.
Memory allocation breakdown per parameter in mixed-precision training.
The Zero Redundancy Optimizer (ZeRO) addresses this inefficiency by acknowledging that while all GPUs need access to all weights during the forward and backward passes, they do not need to persist all weights, gradients, and optimizer states simultaneously.
ZeRO partitions (shards) the model states across the available data-parallel processes. If there are Nd GPUs, ZeRO splits the data such that each GPU owns 1/Nd of the total state. This sharding can be applied in three progressive stages, each offering greater memory savings at the cost of increased communication complexity.
The optimizer states (Master Weights, Momentum, Variance) constitute the bulk of the memory footprint (12 bytes out of 16). In Stage 1, these states are sharded across Nd GPUs. Each GPU updates only its assigned partition of the optimizer states.
MemoryStage1=2Ψ (Weights)+2Ψ (Grads)+Nd12Ψ (Opt States)
Stage 2 extends sharding to the gradients. As gradients are computed during the backward pass, they are reduced and sharded immediately, rather than aggregated locally.
MemoryStage2=2Ψ (Weights)+Nd2Ψ+12Ψ
Stage 3 is the core of FSDP. It shards the model parameters themselves. At this stage, a GPU only persists a fraction of the model. When a specific layer is needed for computation, the parameters are gathered from other GPUs, used, and then immediately discarded to free memory.
MemoryStage3=Nd16Ψ
The theoretical limit of ZeRO Stage 3 allows the memory footprint per device to approach zero as the number of devices Nd increases, leaving the majority of VRAM available for activations and larger batch sizes.
The divergence in memory efficiency becomes pronounced as we scale the number of GPUs. With DDP, adding GPUs yields no reduction in memory pressure per device. With FSDP (ZeRO Stage 3), memory pressure decreases linearly with the addition of hardware.
For instance, a scenario training a Large Language Model with Ψ parameters. The chart below demonstrates the maximum model size trainable on an 80GB A100 GPU as the cluster size increases.
Maximum trainable model size (in Billions of parameters) per GPU as cluster size scales.
In the DDP configuration (red line), the maximum model size is strictly capped around 3.5 billion parameters per GPU (allowing buffer for activations). Adding 60 more GPUs does not change this limit. In the FSDP configuration (blue line), the capacity scales linearly. With 64 GPUs, the cluster can effectively train a model approaching 175 billion parameters, as the 16Ψ static state is distributed thinly across the cluster.
It is important to note that ZeRO only reduces the memory footprint of the model state. It does not inherently reduce the memory required for activations, the intermediate outputs of layers stored for the backward pass. Activation memory depends on batch size, sequence length, and transformer architecture (e.g., hidden dimension, attention heads).
While FSDP frees up massive amounts of VRAM from model parameters, training terabyte-scale models often requires combining FSDP with Activation Checkpointing (recomputing activations during the backward pass) to keep activation memory within bounds. We will implement this integration in Chapter 3.
By shifting from DDP to FSDP, we move from a regime where model architecture is constrained by single-device limits to a regime where model size is constrained only by total cluster capacity and network bandwidth.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with