Building upon the ideas of adaptive learning rates seen in AdaGrad and RMSprop, and incorporating the concept of momentum, the Adam optimizer has become one of the most widely used and effective optimization algorithms in deep learning. Adam stands for "Adaptive Moment Estimation," and it computes adaptive learning rates for each parameter by storing exponentially decaying averages of past squared gradients (like RMSprop) and past gradients themselves (like momentum).

Think of Adam as combining the benefits of RMSprop (handling non-stationary objectives and per-parameter learning rates) with those of Momentum (helping accelerate progress along consistent gradient directions and dampening oscillations).

How Adam Works: Combining Momentum and Scaling

Adam maintains two main internal state variables for each parameter $\theta$ being optimized. These are essentially moving averages:

First Moment Vector ( $m$ ): This is an exponentially decaying average of the past gradients. It functions similarly to the momentum term in SGD with Momentum. It helps accelerate convergence and navigate ravines.
Second Moment Vector ( $v$ ): This is an exponentially decaying average of past squared gradients. It functions similarly to the denominator term in RMSprop, providing per-parameter adaptive learning rates based on the magnitude of recent gradients.

Let $g_t$ be the gradient of the objective function with respect to the parameters $\theta$ at timestep $t$ . The updates for the moving averages $m_t$ and $v_t$ are calculated as follows:

m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t

v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2

Here:

$m_0$ and $v_0$ are initialized as vectors of zeros.
$\beta_1$ and $\beta_2$ are hyperparameters controlling the exponential decay rates for the first and second moment estimates, respectively. They are typically close to 1 (common defaults are $\beta_1 = 0.9$ and $\beta_2 = 0.999$ ).
$g_t^2$ indicates the element-wise square of the gradient vector $g_t$ .

Correcting for Initialization Bias

A potential issue arises because $m_0$ and $v_0$ are initialized to zero. Especially during the initial timesteps and when $\beta_1$ and $\beta_2$ are close to 1, the moment estimates $m_t$ and $v_t$ will be biased towards zero. Adam incorporates a bias-correction step to counteract this initialization bias, particularly important early in training.

The bias-corrected estimates, $\hat{m}_t$ and $\hat{v}_t$ , are calculated as:

\hat{m}_t = \frac{m_t}{1 - \beta_1^t}

\hat{v}_t = \frac{v_t}{1 - \beta_2^t}

Notice how the denominators $(1 - \beta_1^t)$ and $(1 - \beta_2^t)$ approach 1 as the timestep $t$ increases. This means the bias correction has a larger effect initially and gradually diminishes over time.

The Adam Update Rule

Finally, the parameters $\theta$ are updated using the bias-corrected moment estimates. The update rule resembles that of RMSprop, but uses the bias-corrected first moment estimate $\hat{m}_t$ instead of the raw gradient $g_t$ :

\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t

Where:

$\alpha$ is the learning rate (step size), another hyperparameter.
$\epsilon$ is a small constant (e.g., $10^{-8}$ ) added to the denominator for numerical stability, preventing division by zero in cases where $\hat{v}_t$ might be very close to zero.

Effectively, Adam calculates an update based on the momentum-like term $\hat{m}_t$ , but scales the learning rate individually for each parameter using the information from the squared gradients stored in $\sqrt{\hat{v}_t}$ . Parameters with larger recent gradients (higher variance) will have their effective learning rate reduced, while parameters with smaller recent gradients (lower variance) will have their effective learning rate increased, relative to the base learning rate $\alpha$ .

Benefits and Considerations of Adam

Adam often performs well in practice across a wide range of deep learning tasks. Its main advantages include:

Combines Benefits: Integrates adaptive learning rates with momentum.
Computational Efficiency: Requires only first-order gradients and little memory overhead.
Well-Suited for: Problems with large datasets, high-dimensional parameter spaces, and noisy or sparse gradients.
Good Default Settings: The recommended default values for $\beta_1$ , $\beta_2$ , and $\epsilon$ often work well, although the learning rate $\alpha$ typically still requires tuning.

However, some studies suggest that in certain situations, Adam might converge to less optimal solutions compared to finely tuned SGD with Momentum, although it often converges faster initially. It's still a very strong baseline and often the default choice for many practitioners.

Adam represents a significant step in the development of optimization algorithms, providing an effective method for training deep neural networks by adapting the learning process based on the history of gradients.

Was this section helpful?