Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems, Vol. 30DOI: 10.48550/arXiv.1706.03762 - This paper introduces the Transformer architecture and Multi-Head Attention, explaining its purpose in allowing the model to attend to different information simultaneously.
Analyzing Multi-Head Self-Attention: Specialized Heads Do The Heavy Lifting, Hu, Minghao, Peng, Yuxing, Huang, Zhen, Li, Dongsheng, Lv, Yiwei, 2019Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Association for Computational Linguistics)DOI: 10.18653/v1/P19-1051 - This research empirically demonstrates that different attention heads specialize in capturing different types of linguistic dependencies (e.g., syntactic, coreference), providing evidence for the limitations of a single head.