Chapter 1: Limits of Data Parallelism and ZeRO Fundamentals

When training deep learning models, the memory capacity of a single GPU acts as a strict physical limit. Distributed Data Parallel (DDP) has long been the standard method for scaling training across multiple devices. However, DDP operates by replicating the entire model state on every GPU. While effective for smaller architectures, this redundancy becomes a bottleneck for large language models. If a model's parameters, gradients, and optimizer states collectively exceed the VRAM of a single device, DDP cannot function, regardless of the total number of GPUs in your cluster.

This chapter examines the architectural shift from model replication to model sharding. We will analyze the specific memory costs associated with training, breaking down the consumption of 32-bit optimizer states versus 16-bit weights and gradients. For a model with parameters $\Psi$ , we will demonstrate why the memory footprint often scales to approximately $16\Psi$ bytes during mixed-precision training with Adam, rather than the assumed $2\Psi$ or $4\Psi$ .

You will learn the theoretical and practical foundations of the Zero Redundancy Optimizer (ZeRO), which powers PyTorch FSDP. The content details how ZeRO partitions data across available devices to eliminate redundancy. We will cover the three stages of ZeRO optimization:

Stage 1: Sharding optimizer states.
Stage 2: Sharding gradients.
Stage 3: Sharding model parameters.

By understanding these mechanics, you will be able to calculate the precise memory requirements for your hardware and configure the necessary communication primitives. The chapter concludes with a practical exercise where you will wrap a standard PyTorch module with FSDP, establishing a baseline for the advanced optimizations covered later in the course.

Sections

1.1 Memory Consumption in DDP vs FSDP
1.2 ZeRO Stages and Sharding Strategies
1.3 Communication Volume Analysis
1.4 Implementing Basic FSDP Wrappers