Having explored methods that adapt learning rates based solely on the second moments of gradients (like RMSprop) and methods that use first moments to build velocity (like Momentum), a natural question arises: can we combine the benefits of both? Can we have an algorithm that uses momentum to accelerate convergence along consistent gradient directions while also adapting the step size for each parameter based on the historical magnitude of its gradients? The Adam optimizer provides an affirmative answer and has become one of the most widely used optimization algorithms in deep learning.
Adam, which stands for Adaptive Moment Estimation, computes adaptive learning rates for each parameter by keeping track of exponentially decaying averages of both past gradients (the first moment) and past squared gradients (the second moment).
Let's break down the mechanics of Adam. At each timestep t, with parameters θt and loss function J(θ), Adam performs the following steps:
Compute Gradient: Calculate the gradient of the loss function with respect to the current parameters: gt=∇θJ(θt)
Update Biased First Moment Estimate: Update the moving average of the gradient (the first moment estimate). This is similar to the momentum update but uses an exponential moving average controlled by the hyperparameter β1. mt=β1mt−1+(1−β1)gt Here, mt is the estimate of the mean of the gradients. Common values for β1 are typically close to 1, such as 0.9.
Update Biased Second Moment Estimate: Update the moving average of the squared gradients (the second raw moment estimate). This resembles the update in RMSprop, using an exponential moving average controlled by the hyperparameter β2. vt=β2vt−1+(1−β2)gt2 Here, vt estimates the uncentered variance of the gradients (element-wise square is used for gt2). β2 is also usually close to 1, for example, 0.999.
Compute Bias-Corrected First Moment Estimate: The moment estimates mt and vt are initialized as vectors of zeros. This initialization introduces a bias towards zero, especially during the initial timesteps. Adam corrects for this bias by computing bias-corrected estimates: m^t=1−β1tmt As t increases, the bias correction term 1−β1t approaches 1, making the correction less significant.
Compute Bias-Corrected Second Moment Estimate: Similarly, correct the bias in the second moment estimate: v^t=1−β2tvt The term 1−β2t corrects the bias for the second moment estimate.
Update Parameters: Finally, update the parameters. The update rule uses the bias-corrected first moment estimate m^t (similar to momentum) and divides it by the square root of the bias-corrected second moment estimate v^t (similar to RMSprop), effectively providing a per-parameter learning rate scaling. θt+1=θt−ηv^t+ϵm^t Here, η is the learning rate (step size), and ϵ is a small constant (e.g., 10−8) added for numerical stability to prevent division by zero in cases where v^t is very close to zero.
Adam introduces a few hyperparameters:
One of the appealing aspects of Adam is that its default hyperparameter values often work well across a range of problems, reducing the need for extensive tuning compared to some other optimizers.
Adam effectively combines the benefits observed in RMSprop and Momentum:
Adam is computationally efficient, requires little memory overhead (it only needs to store the first and second moment vectors, which are the same size as the parameters), and is generally well-suited for problems with large datasets and high-dimensional parameter spaces, which are common in deep learning.
It's helpful to see how Adam relates to algorithms we've already discussed:
Adam can be viewed as a sophisticated blend of RMSprop (handling adaptive scaling based on second moments) and momentum (using first moments to guide direction), with an added bias correction mechanism to improve stability during the initial phases of optimization. Its widespread adoption stems from its strong empirical performance across various machine learning tasks, particularly in training deep neural networks. While later sections will discuss variations like AMSGrad that address specific potential convergence issues with Adam, the original Adam algorithm remains a foundational and highly effective optimization technique.
© 2025 ApX Machine Learning