Pre-Normalization vs Post-Normalization (Pre-LN vs Post-LN)
Was this section helpful?
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems 30 (NIPS 2017)DOI: 10.48550/arXiv.1706.03762 - Introduces the original Transformer architecture, which uses Post-Normalization.
A Survey of Transformer Architectures and Applications, Hafiz Tayyab, Muhammad Umair Khan, Asif Ali Laghari, Abdullah Khan, 2022IEEE Access, Vol. 10 (IEEE)DOI: 10.1109/ACCESS.2022.3195092 - Provides a broad overview of Transformer architectural variants, including discussions on normalization placement and its impact on training large models.