Adam: A Method for Stochastic Optimization, Diederik P. Kingma, Jimmy Ba, 2015International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.1412.6980 - Introduces the Adam optimizer, detailing its design and the two first and second moment estimates per parameter, which explain its memory footprint.
Deep Learning, Ian Goodfellow, Yoshua Bengio, Aaron Courville, 2016 (MIT Press) - A foundational textbook covering neural network training, backpropagation, and memory considerations for activations and parameters.
NVIDIA Deep Learning Performance Guide, NVIDIA Corporation, 2023 (NVIDIA Corporation) - Official guide detailing best practices for optimizing deep learning performance on NVIDIA GPUs, including memory management considerations.
ZeRO: Memory Optimizations Toward Training Trillion-Parameter Models, Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He, 2020Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '20) (ACM)DOI: 10.1145/3410464.3410714 - Presents ZeRO, a family of memory optimization technologies essential for training models with billions of parameters by efficiently distributing model states across multiple GPUs.