Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems 30 (NeurIPS 2017)DOI: 10.48550/arXiv.1706.03762 - Introduces the Transformer model and multi-head attention mechanism, detailing the linear projections for Q, K, and V vectors.
Transformers and Pretrained Language Models (Lecture Notes CS224N, Winter 2023), Abigail See, Chris Manning, and Stanford CS224N Staff, 2023 (Stanford University) - Provides a clear explanation of Transformer components, including the role of linear projections in multi-head attention, as part of a university course on deep learning for NLP.
Speech and Language Processing (3rd ed. draft), Daniel Jurafsky and James H. Martin, 2025 (Stanford University) - Offers a comprehensive treatment of natural language processing, including a detailed chapter on the Transformer architecture, multi-head attention, and QKV projections.