Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - This book provides foundational knowledge on neural network training, including numerical stability, optimization, and issues like vanishing/exploding gradients.
On the difficulty of training recurrent neural networks, Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio, 2013International Conference on Machine Learning (ICML), Vol. 28 (PMLR)DOI: 10.55982/pascanu13 - This paper introduces the problem of vanishing/exploding gradients and proposes gradient clipping as a solution, widely used to prevent training instability.
Mixed-Precision Training, Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu, 2018International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.1710.03740 - This paper introduces techniques for mixed-precision training, important for managing numerical stability and preventing NaN loss in large models.
Automatic Mixed Precision (AMP), PyTorch Developers, Accessed 2024 (PyTorch Documentation) - Provides official guidance and best practices for using mixed precision in PyTorch, covering gradient scaling to prevent numerical issues and NaN loss.