Decoupled Weight Decay Regularization, Ilya Loshchilov, Frank Hutter, 2019International Conference on Learning Representations (ICLR 2019)DOI: 10.48550/arXiv.1711.05101 - This paper introduces AdamW, a modification to Adam that correctly applies weight decay, improving regularization.
Deep Learning, Ian Goodfellow, Yoshua Bengio, Aaron Courville, 2016 (MIT Press) - A foundational textbook offering a comprehensive overview of deep learning, including detailed explanations of optimization algorithms.