Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017NeurIPSDOI: 10.48550/arXiv.1706.03762 - This foundational paper introduces the Transformer architecture, detailing its encoder-decoder stack, multi-head attention, residual connections, and layer normalization.
Layer Normalization, Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton, 2016arXivDOI: 10.48550/arXiv.1607.06450 - This paper introduces Layer Normalization, a technique stabilizing and accelerating the training of deep neural networks by normalizing activations within each layer.