Masterclass
Training large language models presents unique optimization challenges due to their immense scale and computational requirements. Standard optimization techniques often need adaptation for effective LLM training. This chapter focuses on the optimization algorithms and strategies commonly used.
We will begin with a brief review of gradient descent methods before concentrating on adaptive optimizers like Adam and AdamW, explaining the rationale behind decoupled weight decay. You will learn to implement common learning rate scheduling strategies involving warmup and decay phases (e.g., ηt=schedule(step)). We will also cover gradient clipping, a technique often used to prevent exploding gradients and improve training stability, typically by rescaling gradients whose norm exceeds a threshold c: g←∥g∥cg if ∥g∥>c Finally, we'll discuss practical guidance for selecting key optimizer hyperparameters, such as the learning rate η, Adam's momentum terms (β1,β2), the numerical stability term ϵ, and the weight decay coefficient λ.
17.1 Review of Gradient Descent Variants (SGD, Momentum)
17.2 Adaptive Optimizers: Adam and AdamW
17.3 Learning Rate Scheduling Strategies
17.4 Gradient Clipping Techniques
17.5 Choosing Optimizer Hyperparameters (lr, betas, eps, weight_decay)
© 2025 ApX Machine Learning