Techniques for Training Stability (Gradient Clipping, EMA)
Was this section helpful?
On the difficulty of training recurrent neural networks, Razvan Pascanu, Tomas Mikolov, Yoshua Bengio, 2013Proceedings of the 30th International Conference on Machine Learning (ICML)DOI: 10.48550/arXiv.1211.5063 - This foundational paper introduced gradient clipping to address exploding gradients, a technique now widely used to stabilize training in deep neural networks.
Denoising Diffusion Probabilistic Models, Jonathan Ho, Ajay Jain, Pieter Abbeel, 2020Advances in Neural Information Processing Systems (NeurIPS), Vol. 33DOI: 10.48550/arXiv.2006.11239 - A seminal paper that introduced Denoising Diffusion Probabilistic Models (DDPMs) and highlights the practical benefit of Exponential Moving Average (EMA) for model weights to achieve high-quality sample generation.
Deep Learning, Ian Goodfellow, Yoshua Bengio, Aaron Courville, 2016 (MIT Press) - An authoritative textbook covering fundamental concepts in deep learning, including detailed explanations of optimization challenges like exploding gradients and general strategies for stable training.
torch.nn.utils.clip_grad_norm_, PyTorch Contributors, 2024 - The official PyTorch documentation for gradient clipping by norm, providing practical usage and parameters for implementing this training stability technique.