ZeRO: Memory Optimizations Toward Training Trillion Parameter Models, Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He, 2020SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (IEEE)DOI: 10.1109/SC45798.2020.00078 - Describes the Zero Redundancy Optimizer (ZeRO), which partitions model states (parameters, gradients, optimizer states) across devices to reduce memory footprint for training large models.
DistributedDataParallel (DDP), PyTorch Contributors, 2024 (PyTorch) - Official documentation for PyTorch's DistributedDataParallel (DDP), a core module for implementing data parallelism.