Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - Provides a comprehensive theoretical and practical treatment of L1 and L2 regularization within deep learning models, explaining their role in combating overfitting and improving generalization.
The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Trevor Hastie, Robert Tibshirani, and Jerome Friedman, 2009 (Springer) - A classic text that details the mathematical foundations of L1 (Lasso) and L2 (Ridge) regularization, explaining their statistical properties and connection to linear models.
Decoupled Weight Decay Regularization, Ilya Loshchilov, Frank Hutter, 2017International Conference on Learning Representations (ICLR 2019)DOI: 10.48550/arXiv.1711.05101 - Introduces AdamW, an improved version of the Adam optimizer that correctly decouples weight decay from the L2 regularization term, which is important for practical deep learning training.