ZeRO: Memory Optimizations Toward Training Trillion Parameter Models, Samyam Rajbhandari, Cong Guo, Jeff Rasley, Shaden Smith, Yuxiong He, 2020SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (ACM)DOI: 10.1145/3418856.3418915 - Introduces the Zero Redundancy Optimizer (ZeRO) which shards optimizer states, gradients, and optionally parameters to significantly reduce memory footprint in distributed training.
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism, Yanping Huang, Youlong Cheng, Dehao Chen, Hyoukjin Kwon, Ankur Bapna, Zhifeng Chen, Mia Xu Chen, Jonathan Dean, Marc Edwards, Yuan Gong, Geoffrey Hinton, Lars Jylkka, Sebastian Kastner, Ravi Kumar, Da Li, Quoc V. Le, Jiquan Ngiam, Jeff Norris, Adam Paszke, Alexandre Passos, James Perkins, Sascha Pokrovsky, Jamie Smith, Noam Shazeer, Aurora S. Smith, Barret Zoph, Yonghui Wu, 2019Advances in Neural Information Processing Systems, Vol. 32 (NeurIPS Proceedings)DOI: 10.5591/978-1-7138-0401-4.neurips-2019-397 - Presents GPipe, an inter-layer model parallelism method that uses micro-batching to mitigate communication overhead and improve device utilization.