Decoupled Weight Decay Regularization, Ilya Loshchilov, Frank Hutter, 2019International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.1711.05101 - Introduces AdamW, demonstrating that decoupling weight decay from adaptive gradient updates improves generalization for adaptive optimizers.
Large Batch Optimization for Deep Learning, Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, Cho-Jui Hsieh, 2020International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.1904.00962 - Proposes LAMB, an optimizer designed for stable and efficient training of deep learning models with extremely large batch sizes.
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems (NeurIPS), Vol. 30DOI: 10.48550/arXiv.1706.03762 - Introduces the Transformer architecture and its associated learning rate schedule (warm-up followed by inverse square root decay), widely adopted for large language models.