Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - This book provides a comprehensive theoretical foundation for deep learning, covering concepts such as neural networks, backpropagation, optimization algorithms, and loss functions, which are fundamental to understanding the training loop.
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems, Vol. 30 (Curran Associates, Inc.) - Introduces the Transformer architecture, which forms the basis of Large Language Models. Understanding this architecture helps grasp how LLMs process input and generate logits within the forward pass.
Decoupled Weight Decay Regularization, Ilya Loshchilov and Frank Hutter, 2019International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.1711.05101 - Presents AdamW, an optimization algorithm widely used for training transformer models, as specifically mentioned in the section for the optimizer step.
PyTorch Documentation: Tutorials, Autograd, and Optim, PyTorch Core Developers, Ongoing (PyTorch Foundation) - The official resource for practical implementation of deep learning training loops using PyTorch, including details on automatic differentiation, optimizers, and data handling.
Hugging Face Transformers: Trainer Class Documentation, Hugging Face, Ongoing (Hugging Face) - Describes the Trainer class, a high-level API for fine-tuning Transformer models, which abstracts many details of the training loop for efficient LLM adaptation.