Leveraging Microsoft DeepSpeed for ZeRO and Offloading
Was this section helpful?
ZeRO: Memory Optimizations Toward Training Trillion-Parameter Models, Samyam Rajbhandari, Cong Guo, Jeff Rasley, Shaden Smith, Yuxiong He, 2020SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (ACM)DOI: 10.1145/3418856.3418915 - Presents the Zero Redundancy Optimizer (ZeRO) and its three stages for scaling deep learning model training memory efficiently.
DeepSpeed Official Documentation, Microsoft DeepSpeed Team, 2024 - Provides comprehensive guides and API references for the DeepSpeed library, including ZeRO configuration and offloading features.
torch.nn.parallel.DistributedDataParallel, PyTorch Core Team, 2024 - Official documentation for PyTorch's standard distributed data parallelism, useful for understanding the baseline memory replication problem.