Adam: A Method for Stochastic Optimization, Diederik P. Kingma, Jimmy Ba, 2015International Conference on Learning Representations (ICLR 2015)DOI: 10.48550/arXiv.1412.6980 - Introduces the Adam optimizer, which is widely used in deep learning and discussed for its memory footprint in the section due to its optimizer states.
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Ćukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems 30 (NeurIPS 2017), Vol. 30 (Curran Associates, Inc.)DOI: 10.48550/arXiv.1706.03762 - The foundational paper for the Transformer architecture, which forms the basis of modern LLMs and whose attention mechanism contributes significantly to computational and memory challenges.
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism, Mohammad Shoeybi, Mostafa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro, Greg Bernhardt, Vaibhav Gadi, Dale Green, Vijay Korthikoti, Michael Houston, 2019Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '19)DOI: 10.1145/3295500.3356183 - Presents a framework and techniques for efficiently training very large transformer-based language models, directly addressing the memory and compute challenges discussed.
Scaling Laws for Neural Language Models, Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei, 2020arXiv preprint arXiv:2001.08361DOI: 10.48550/arXiv.2001.08361 - Investigates the empirical scaling laws for language models, demonstrating how model size, dataset size, and compute budget influence performance and the necessity of massive computation.