Having introduced Adam (Adaptive Moment Estimation) as an optimizer that combines the benefits of momentum and adaptive scaling like RMSprop, let's look closely at its update mechanism. Adam achieves this by maintaining two separate exponentially decaying moving averages: one for the past gradients (first moment estimate) and another for the past squared gradients (second moment estimate).
At each timestep t, after computing the gradients gt=∇θJ(θt−1) with respect to the parameters θ, Adam updates these two moving averages:
First Moment Estimate (Mean): This is similar to the momentum term. It accumulates an exponentially decaying average of past gradients.
mt=β1mt−1+(1−β1)gtHere, mt is the first moment vector, gt is the gradient at the current timestep, and β1 is the exponential decay rate for the first moment estimates (typically close to 1, e.g., 0.9). m0 is initialized as a vector of zeros.
Second Moment Estimate (Uncentered Variance): This is similar to the term used in RMSprop. It accumulates an exponentially decaying average of past squared gradients.
vt=β2vt−1+(1−β2)gt2Here, vt is the second moment vector, gt2 represents the element-wise square of the gradient vector, and β2 is the exponential decay rate for the second moment estimates (often set higher, e.g., 0.999). v0 is also initialized as a vector of zeros.
The terms mt and vt are estimates of the mean and the uncentered variance of the gradients, respectively. The hyperparameters β1 and β2 control the decay rates of these moving averages. Values closer to 1 mean that past gradients have a longer influence.
A potential issue arises because mt and vt are initialized to zero vectors. Especially during the initial timesteps of training, when t is small, these estimates are biased towards zero. Imagine if β1=0.9; then m1=0.1g1. This is significantly smaller than the actual gradient.
Adam addresses this initialization bias by computing bias-corrected first and second moment estimates, denoted as m^t and v^t:
m^t=1−β1tmt v^t=1−β2tvtNotice the terms (1−β1t) and (1−β2t) in the denominators. At the beginning of training (small t), β1t and β2t are close to 1, making the denominators small. This division effectively counteracts the initial zero bias. As training progresses and t increases, β1t and β2t approach zero (since β1,β2<1), so the correction terms (1−β1t) and (1−β2t) approach 1, and the bias correction has less effect. This ensures that the estimates are more accurate throughout the training process.
Finally, Adam uses these bias-corrected estimates to update the model parameters θ. The update rule looks quite similar to RMSprop but uses the corrected momentum estimate m^t instead of the raw gradient gt:
θt=θt−1−αv^t+ϵm^tHere:
To summarize, the Adam update process at each timestep t involves these steps:
The authors of the original Adam paper suggest default values of β1=0.9, β2=0.999, and ϵ=10−8. The learning rate α (often suggested around 0.001) remains a hyperparameter that usually requires tuning.
This step-by-step process, combining adaptive scaling based on the second moment and momentum based on the first moment, along with the crucial bias correction step, makes Adam a widely used and often effective default optimizer for deep learning models.
© 2025 ApX Machine Learning