Distributed Training of Large Models with PyTorch FSDP
Chapter 1: Limits of Data Parallelism and ZeRO Fundamentals
Memory Consumption in DDP vs FSDP
ZeRO Stages and Sharding Strategies
Communication Volume Analysis
Implementing Basic FSDP Wrappers
Chapter 2: Model Wrapping and Initialization Policies
Transformer Wrapping Policies
Custom Wrapping Strategies
Delayed Initialization and meta Device
Handling Shared Parameters
Code Practice: Advanced Wrapping Configuration
Chapter 3: Mixed Precision and Memory Optimization
BFloat16 vs Float16 Configurations
Activation Checkpointing Mechanics
CPU Offloading Implementation
Gradient Accumulation with Sharding
Practice: Tuning Memory Constraints
Chapter 4: Multi-Node Scaling and NCCL Tuning
Initializing Multi-Node Process Groups
NCCL Collective Communication Primitives
Rate Limiting and Backward Prefetching
Hybrid Sharding Strategies
Practice: Multi-Node Cluster Setup
Chapter 5: Distributed Checkpointing and Fault Tolerance
Sharded vs Full State Dictionaries
PyTorch Distributed Checkpointing API
Elastic Training Integration
Practice: Implementing Resumable Training
Chapter 6: Profiling and Performance Engineering
Interpreting PyTorch Profiler Traces
Analyzing Communication Overlap
Memory Fragmentation Analysis
Throughput Optimization Techniques
Practice: Optimization Case Work