Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems, Vol. 30 (Neural Information Processing Systems (NeurIPS))DOI: 10.5555/3295222.3295232 - 介绍了Transformer架构及其包含预热阶段的特定学习率调度,是大型语言模型的基础。