Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems 30 (NIPS 2017)DOI: 10.48550/arXiv.1706.03762 - Presents the Transformer architecture, explaining the multi-head attention mechanism's structure, including the concatenation of head outputs and the final linear projection.
torch.nn.Linear, PyTorch Development Team, 2024PyTorch Documentation (PyTorch Foundation) - Describes the function of a linear transformation layer, which is applied for both input projections ($W^Q, W^K, W^V$) and the output projection ($W^O$) in the multi-head attention module.