Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems 30 (NIPS 2017)DOI: 10.48550/arXiv.1706.03762 - 这篇基础论文介绍了Transformer架构和多头自注意力机制,详细阐述了其数学公式和并行处理的优点。
Dive into Deep Learning, Aston Zhang, Zachary C. Lipton, Mu Li, Alex Smola, 2024 (Cambridge University Press) - 一本易于理解的在线教科书章节,清晰且实用地解释了多头注意力机制,包括其并行计算和实现细节。