Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems 30DOI: 10.48550/arXiv.1706.03762 - The original paper introducing the Transformer architecture, detailing the encoder-decoder structure, multi-head attention, position-wise feed-forward networks, residual connections, and layer normalization.
On Layer Normalization in the Transformer Architecture, Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, Tie-Yan Liu, 2020International Conference on Machine Learning (ICML)DOI: 10.48550/arXiv.2002.04745 - This paper analyzes the two common layer normalization variants in Transformer architecture: Post-LN and Pre-LN, discussing their impact on training stability and performance.