Building upon the ideas of adaptive learning rates seen in AdaGrad and RMSprop, and incorporating the concept of momentum, the Adam optimizer has become one of the most widely used and effective optimization algorithms in deep learning. Adam stands for "Adaptive Moment Estimation," and it computes adaptive learning rates for each parameter by storing exponentially decaying averages of past squared gradients (like RMSprop) and past gradients themselves (like momentum).
Think of Adam as combining the benefits of RMSprop (handling non-stationary objectives and per-parameter learning rates) with those of Momentum (helping accelerate progress along consistent gradient directions and dampening oscillations).
Adam maintains two key internal state variables for each parameter θ being optimized. These are essentially moving averages:
Let gt be the gradient of the objective function with respect to the parameters θ at timestep t. The updates for the moving averages mt and vt are calculated as follows:
mt=β1mt−1+(1−β1)gt vt=β2vt−1+(1−β2)gt2Here:
A potential issue arises because m0 and v0 are initialized to zero. Especially during the initial timesteps and when β1 and β2 are close to 1, the moment estimates mt and vt will be biased towards zero. Adam incorporates a bias-correction step to counteract this initialization bias, particularly important early in training.
The bias-corrected estimates, m^t and v^t, are calculated as:
m^t=1−β1tmt v^t=1−β2tvtNotice how the denominators (1−β1t) and (1−β2t) approach 1 as the timestep t increases. This means the bias correction has a larger effect initially and gradually diminishes over time.
Finally, the parameters θ are updated using the bias-corrected moment estimates. The update rule resembles that of RMSprop, but uses the bias-corrected first moment estimate m^t instead of the raw gradient gt:
θt+1=θt−v^t+ϵαm^tWhere:
Effectively, Adam calculates an update based on the momentum-like term m^t, but scales the learning rate individually for each parameter using the information from the squared gradients stored in v^t. Parameters with larger recent gradients (higher variance) will have their effective learning rate reduced, while parameters with smaller recent gradients (lower variance) will have their effective learning rate increased, relative to the base learning rate α.
Adam often performs well in practice across a wide range of deep learning tasks. Its main advantages include:
However, some studies suggest that in certain situations, Adam might converge to less optimal solutions compared to finely tuned SGD with Momentum, although it often converges faster initially. It's still a very strong baseline and often the default choice for many practitioners.
Adam represents a significant step in the development of optimization algorithms, providing a robust and generally effective method for training deep neural networks by adapting the learning process based on the history of gradients.
© 2025 ApX Machine Learning