Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems, Vol. 30 (Curran Associates, Inc.)DOI: 10.55988/neurips-2017-578 - Foundational paper introducing the Transformer architecture, which has significantly contributed to the growth of large language models and their associated training challenges.
Mixed-Precision Training, Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, Hao Wu, 2017ICLR 2018DOI: 10.48550/arXiv.1710.03740 - Describes mixed-precision training techniques, which reduce memory footprint and speed up computation by using lower-precision floating-point formats like FP16.