All Courses

Distributed Training of Large Models with PyTorch FSDP

Chapter 1: Limits of Data Parallelism and ZeRO Fundamentals

Memory Consumption in DDP vs FSDP

ZeRO Stages and Sharding Strategies

Communication Volume Analysis

Implementing Basic FSDP Wrappers

Chapter 2: Model Wrapping and Initialization Policies

Transformer Wrapping Policies

Custom Wrapping Strategies

Delayed Initialization and meta Device

Handling Shared Parameters

Code Practice: Advanced Wrapping Configuration

Chapter 3: Mixed Precision and Memory Optimization

BFloat16 vs Float16 Configurations

Activation Checkpointing Mechanics

CPU Offloading Implementation

Gradient Accumulation with Sharding

Practice: Tuning Memory Constraints

Chapter 4: Multi-Node Scaling and NCCL Tuning

Initializing Multi-Node Process Groups

NCCL Collective Communication Primitives

Rate Limiting and Backward Prefetching

Hybrid Sharding Strategies

Practice: Multi-Node Cluster Setup

Chapter 5: Distributed Checkpointing and Fault Tolerance

Sharded vs Full State Dictionaries

PyTorch Distributed Checkpointing API

Elastic Training Integration

Practice: Implementing Resumable Training

Chapter 6: Profiling and Performance Engineering

Interpreting PyTorch Profiler Traces

Analyzing Communication Overlap

Memory Fragmentation Analysis

Throughput Optimization Techniques

Practice: Optimization Case Work

Distributed Training of Large Models with PyTorch FSDP

Prerequisites Advanced PyTorch, distributed concepts

Level:

Expert

FSDP Architecture
Architect scaling solutions using ZeRO stages to partition parameters, gradients, and optimizer states.
Memory Optimization
Implement activation checkpointing and CPU offloading to maximize per-GPU throughput.
Multi-Node Networking
Configure and tune NCCL communications for efficient cross-node scaling.
Performance Profiling
Analyze communication-computation overlap and resolve memory fragmentation issues.

There are no prerequisite courses for this course.

There are no recommended next courses at the moment.

Login to Write a Review

Share your feedback to help other learners.

© 2025 ApX Machine LearningEngineered with