Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Ćukasz Kaiser, Illia Polosukhin, 2017NeurIPSDOI: 10.48550/arXiv.1706.03762 - The original paper introducing the Transformer model and defining multi-head attention as a central component.
Speech and Language Processing (3rd ed. draft), Daniel Jurafsky and James H. Martin, 2025 - Offers a detailed explanation of multi-head attention within the Transformer architecture, including its motivation and operation. Refer to the chapter on Transformers.
torch.nn.MultiheadAttention, PyTorch Development Team, 2024 (PyTorch Foundation) - Official documentation for the PyTorch implementation of multi-head attention, showing its parameters and use.
Transformers and Large Language Models (CS224N Lecture Notes), Christopher Manning, Abigail See, John Hewitt, Misha Smelyanskiy, and others, 2023 - Stanford CS224N course material providing academic explanations of the Transformer architecture, with sections covering multi-head attention.