Adam (Adaptive Moment Estimation) is an optimizer that blends the advantages of momentum and adaptive scaling, characteristic of methods like RMSprop. This algorithm's update mechanism involves maintaining two separate exponentially decaying moving averages: one for the past gradients (first moment estimate) and another for the past squared gradients (second moment estimate).
At each timestep , after computing the gradients with respect to the parameters , Adam updates these two moving averages:
First Moment Estimate (Mean): This is similar to the momentum term. It accumulates an exponentially decaying average of past gradients.
Here, is the first moment vector, is the gradient at the current timestep, and is the exponential decay rate for the first moment estimates (typically close to 1, e.g., 0.9). is initialized as a vector of zeros.
Second Moment Estimate (Uncentered Variance): This is similar to the term used in RMSprop. It accumulates an exponentially decaying average of past squared gradients.
Here, is the second moment vector, represents the element-wise square of the gradient vector, and is the exponential decay rate for the second moment estimates (often set higher, e.g., 0.999). is also initialized as a vector of zeros.
The terms and are estimates of the mean and the uncentered variance of the gradients, respectively. The hyperparameters and control the decay rates of these moving averages. Values closer to 1 mean that past gradients have a longer influence.
A potential issue arises because and are initialized to zero vectors. Especially during the initial timesteps of training, when is small, these estimates are biased towards zero. Imagine if ; then . This is significantly smaller than the actual gradient.
Adam addresses this initialization bias by computing bias-corrected first and second moment estimates, denoted as and :
Notice the terms and in the denominators. At the beginning of training (small ), and are close to 1, making the denominators small. This division effectively counteracts the initial zero bias. As training progresses and increases, and approach zero (since ), so the correction terms and approach 1, and the bias correction has less effect. This ensures that the estimates are more accurate throughout the training process.
Finally, Adam uses these bias-corrected estimates to update the model parameters . The update rule looks quite similar to RMSprop but uses the corrected momentum estimate instead of the raw gradient :
Here:
To summarize, the Adam update process at each timestep involves these steps:
The authors of the original Adam paper suggest default values of , , and . The learning rate (often suggested around ) remains a hyperparameter that usually requires tuning.
This step-by-step process, combining adaptive scaling based on the second moment and momentum based on the first moment, along with the important bias correction step, makes Adam a widely used and often effective default optimizer for deep learning models.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•