Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin, 2017Advances in Neural Information Processing Systems (NeurIPS), Vol. 30DOI: 10.48550/arXiv.1706.03762 - 介绍了Transformer架构,并展示了包含预热阶段的学习率调度策略的有效性。