Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Ćukasz Kaiser, and Illia Polosukhin, 2017Advances in Neural Information Processing Systems (NeurIPS) 30, Vol. 30 (NeurIPS)DOI: 10.5555/3295222.3295349 - The foundational paper introducing the Transformer architecture, including its encoder and decoder layers, multi-head attention, residual connections, and layer normalization.
Layer Normalization, Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton, 2016arXiv preprint arXiv:1607.06450DOI: 10.48550/arXiv.1607.06450 - Introduces and explains layer normalization, a technique used in Transformer layers for stabilizing training.