The Adam optimizer is one of the most widely used and effective optimization algorithms in deep learning. It incorporates principles of adaptive learning rates and momentum. Adam, which stands for "Adaptive Moment Estimation," computes adaptive learning rates for each parameter. This is achieved by storing exponentially decaying averages of both past squared gradients (a technique often used for adaptive learning rates) and past gradients themselves (a method for incorporating momentum).
Think of Adam as combining the benefits of RMSprop (handling non-stationary objectives and per-parameter learning rates) with those of Momentum (helping accelerate progress along consistent gradient directions and dampening oscillations).
Adam maintains two main internal state variables for each parameter being optimized. These are essentially moving averages:
Let be the gradient of the objective function with respect to the parameters at timestep . The updates for the moving averages and are calculated as follows:
Here:
A potential issue arises because and are initialized to zero. Especially during the initial timesteps and when and are close to 1, the moment estimates and will be biased towards zero. Adam incorporates a bias-correction step to counteract this initialization bias, particularly important early in training.
The bias-corrected estimates, and , are calculated as:
Notice how the denominators and approach 1 as the timestep increases. This means the bias correction has a larger effect initially and gradually diminishes over time.
Finally, the parameters are updated using the bias-corrected moment estimates. The update rule resembles that of RMSprop, but uses the bias-corrected first moment estimate instead of the raw gradient :
Where:
Effectively, Adam calculates an update based on the momentum-like term , but scales the learning rate individually for each parameter using the information from the squared gradients stored in . Parameters with larger recent gradients (higher variance) will have their effective learning rate reduced, while parameters with smaller recent gradients (lower variance) will have their effective learning rate increased, relative to the base learning rate .
Adam often performs well in practice across a wide range of deep learning tasks. Its main advantages include:
However, some studies suggest that in certain situations, Adam might converge to less optimal solutions compared to finely tuned SGD with Momentum, although it often converges faster initially. It's still a very strong baseline and often the default choice for many practitioners.
Adam represents a significant step in the development of optimization algorithms, providing an effective method for training deep neural networks by adapting the learning process based on the history of gradients.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with