Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems 30 (NIPS 2017)DOI: 10.48550/arXiv.1706.03762 - This foundational paper introduces the Transformer architecture and the concept of multi-head self-attention, detailing its mathematical formulation and parallel processing advantages.
Dive into Deep Learning, Aston Zhang, Zachary C. Lipton, Mu Li, Alex Smola, 2024 (Cambridge University Press) - An accessible online textbook chapter providing a clear, practical explanation of multi-head attention, including its parallel computation and implementation details.