Large Batch Optimization for Deep Learning, Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, Cho-Jui Hsieh, 2020International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.1904.00962 - 提出了 LAMB,一种旨在以极大批次大小稳定高效训练深度学习模型的优化器。
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems (NeurIPS), Vol. 30DOI: 10.48550/arXiv.1706.03762 - 介绍了 Transformer 架构及其相关的学习率调度策略(热身期后接逆平方根衰减),该策略被大型语言模型广泛采用。