Layer Normalization, Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton, 2016Advances in Neural Information Processing Systems, Vol. 29 (Curran Associates, Inc.)DOI: 10.5555/3045607.3045657 - Introduces Layer Normalization as a method to stabilize hidden state dynamics and accelerate training in deep neural networks, particularly recurrent networks.
On Layer Normalization in the Transformer Architecture, Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, Tie-Yan Liu, 2020ICML 2020DOI: 10.48550/arXiv.2002.04745 - Provides a comprehensive analysis of Layer Normalization's placement (Pre-LN vs. Post-LN) within the Transformer, demonstrating Pre-LN's superior training stability.