Decoupled Weight Decay Regularization, Ilya Loshchilov, Frank Hutter, 2019International Conference on Learning Representations (ICLR) 2019DOI: 10.48550/arXiv.1711.05101 - Proposes AdamW, a variant of Adam that correctly decouples weight decay from the gradient update, leading to better generalization.
Deep Learning, Ian Goodfellow, Yoshua Bengio, Aaron Courville, 2016 (MIT Press) - A comprehensive textbook covering foundational deep learning optimization techniques, including optimizers, learning rate strategies, and gradient clipping.