GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism, Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, Zhifeng Chen, 2019Advances in Neural Information Processing Systems, Vol. 32 (NeurIPS) - Introduces pipeline parallelism and micro-batching to improve hardware utilization in large model training.
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models, Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He, 2020SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (IEEE)DOI: 10.1109/SC41405.2020.00024 - Describes the ZeRO memory optimization strategy, which is critical for scaling models, often combined with model parallelism in DeepSpeed.
PyTorch FSDP: Fully Sharded Data Parallel, PyTorch Documentation, 2022 (PyTorch Foundation) - Official documentation detailing the implementation and usage of Fully Sharded Data Parallel (FSDP) in PyTorch for large model training.