Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems (NeurIPS) 30 (Curran Associates, Inc.)DOI: 10.5555/3295222.3295349 - Introduces the Transformer architecture and the Scaled Dot-Product Attention, Multi-Head Attention, and Self-Attention mechanisms. This is the origin of many modern attention models.
Neural Machine Translation by Jointly Learning to Align and Translate, Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio, 2014International Conference on Learning Representations (ICLR) 2015 (International Conference on Learning Representations)DOI: 10.48550/arXiv.1409.0473 - Presents the additive attention mechanism, often referred to as Bahdanau attention, which was a significant advancement in sequence-to-sequence models.
tf.keras.layers.MultiHeadAttention, TensorFlow Developers, 2024 - Official Keras documentation for the MultiHeadAttention layer, offering details on its parameters and usage in TensorFlow.
Attention Mechanisms, Aston Zhang, Zack C. Lipton, Mu Li, Alex Smola, 2023 (Cambridge University Press) - A chapter from a widely used online textbook providing a pedagogical explanation of various attention mechanisms, including scaled dot-product and additive attention.