Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems, Vol. 30 (Curran Associates, Inc.)DOI: 10.55988/neurips-2017-578 - 提出 Transformer 架构的奠基性论文,该架构极大地推动了大型语言模型的增长及其相关的训练挑战。
Mixed-Precision Training, Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, Hao Wu, 2017ICLR 2018DOI: 10.48550/arXiv.1710.03740 - 描述了混合精度训练技术,通过使用 FP16 等低精度浮点格式,减少了内存占用并加速了计算。