Understanding the difficulty of training deep feedforward neural networks, Xavier Glorot, Yoshua Bengio, 2010Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS), Vol. 9 - Introduces the Xavier (Glorot) initialization method, which aims to maintain activation and gradient variance across layers, particularly useful for symmetric activation functions like tanh.
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems (NeurIPS), Vol. 30 - The foundational paper introducing the Transformer architecture, mentioning the use of Glorot uniform initialization for its weight matrices, which is relevant to Transformer implementation details.
Deep Learning, Ian Goodfellow, Yoshua Bengio, Aaron Courville, 2016 (MIT Press) - A comprehensive textbook covering fundamental concepts in deep learning, including detailed theoretical and practical explanations of various weight initialization strategies.
Transformers Library Documentation: PreTrainedModel, Hugging Face, 2024 - Provides practical guidance on the default weight initialization parameters, such as initializer_range, used in the Hugging Face Transformers library, which is relevant for real-world implementation.