ZeRO: Memory Optimizations Toward Training Trillion-Parameter Models, Samyam Rajbhandari, Cong Guo, Jeff Rasley, Shaden Smith, Yuxiong He, 2020SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (IEEE)DOI: 10.1109/SC41405.2020.00078 - Describes the ZeRO memory optimization strategy, essential for training large models by sharding optimizer states, gradients, and parameters across devices.
Automatic Mixed Precision for Deep Learning, PyTorch Documentation, 2024 (PyTorch Foundation) - Provides guidelines and examples for using automatic mixed precision training in PyTorch, reducing memory footprint and speeding up computation.
Training Deep Nets with Sublinear Memory Cost, Tianqi Chen, Bing Xu, Chiyuan Zhang, Carlos Guestrin, 2016arXiv preprint arXiv:1604.06174DOI: 10.48550/arXiv.1604.06174 - Describes how to reduce memory consumption during training by recomputing intermediate activations in the backward pass, a technique known as gradient checkpointing.
Training Generative Adversarial Networks with Limited Data, Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, Timo Aila, 2020Advances in Neural Information Processing Systems (NeurIPS) 33 - Introduces an adaptive discriminator augmentation method, which significantly improves GAN training with limited data, and its training requires substantial computational resources.