Accurate, Large Minibatch SGD: A Case Study of ImageNet Training, Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, Kaiming He, 2017arXiv preprint arXiv:1706.02677DOI: 10.48550/arXiv.1706.02677 - This paper demonstrates how to effectively scale deep learning training to large batch sizes across multiple GPUs using linear learning rate scaling and efficient gradient aggregation (AllReduce).
DeepSpeed: Extreme-scale distributed training for DL models, Zhewei Yao, Minjia Zhang, Conglong Li, Yuxiong He, 2021In 2021 USENIX Annual Technical Conference (USENIX ATC 21) (USENIX)DOI: 10.5555/3472061.3472251 - Presents a system for training deep learning models at extreme scales, including techniques like pipeline parallelism and ZeRO, which build upon PyTorch primitives.