Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems (NeurIPS)DOI: 10.48550/arXiv.1706.03762 - 本文介绍了Transformer架构及其学习率调度,其中包含线性热身阶段。它对复杂深度学习模型中热身策略的广泛采用产生了深远影响。