BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, 2019Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (Association for Computational Linguistics)DOI: 10.18653/v1/N19-1423 - A seminal paper introducing the BERT model and popularizing the pre-train and fine-tune paradigm for transformer-based language models, directly illustrating the full fine-tuning approach.
Decoupled Weight Decay Regularization, Ilya Loshchilov, Frank Hutter, 2018International Conference on Learning RepresentationsDOI: 10.48550/arXiv.1711.05101 - This paper introduces the AdamW optimizer, explaining its mechanism and benefits, particularly for transformer models, as mentioned in the section content.
Deep Learning, Ian Goodfellow, Yoshua Bengio, Aaron Courville, 2016 (MIT Press) - A comprehensive textbook covering fundamental principles of deep learning, including transfer learning, optimization algorithms like SGD and Adam, and the backpropagation mechanism.