Standard gradient descent methods often rely on a fixed learning rate, η. Finding an effective η requires careful tuning; a rate too small slows convergence, while one too large can hinder it. This chapter introduces algorithms that address this by automatically adjusting the learning rate during training.
We will examine several widely used adaptive learning rate methods. You will learn the mechanisms behind AdaGrad, which scales learning rates based on past gradients, and RMSprop, which modifies AdaGrad's approach to prevent excessive decay. We will then study Adam (Adaptive Moment Estimation), an optimizer combining adaptive rates with momentum estimation. We will also analyze variants such as Adamax, Nadam, and AMSGrad, understanding their specific modifications. The chapter also covers integrating adaptive methods with learning rate schedules.
By the end of this chapter, you will understand the theory and practical application of these adaptive techniques, allowing you to choose and utilize them for more efficient model training.
3.1 Limitations of Fixed Learning Rates
3.2 AdaGrad: Adapting Rates Based on Past Gradients
3.3 RMSprop: Addressing AdaGrad's Diminishing Rates
3.4 Adam: Combining Momentum and RMSprop
3.5 Adamax and Nadam Variants
3.6 AMSGrad: Improving Adam's Convergence
3.7 Understanding Learning Rate Schedules
3.8 Hands-on Practical: Comparing Adaptive Optimizers
© 2025 ApX Machine Learning