Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems 30 (Curran Associates, Inc.)DOI: 10.55988/neurips-2017-1011 - 这篇论文介绍了Transformer架构和缩放点积注意力机制,提供了其原始公式并讨论了并行化的好处。
Dive into Deep Learning, Aston Zhang, Zack C. Lipton, Mu Li, Alex J. Smola, et al., 2023 (Cambridge University Press) - 一本在线教材,提供了深度学习模型的解释和实际实现,其中包含关于注意力机制及其矩阵运算的章节。