While the Adam optimizer, discussed previously, provides a robust combination of momentum and adaptive scaling based on the second moment of gradients, researchers have proposed several variants aiming for further improvements or addressing specific behaviors. Two notable variants are Adamax and Nadam. These methods modify Adam's core components, specifically how it scales the learning rate or incorporates momentum, offering alternative approaches for certain optimization scenarios.
Adam scales the learning rate for each parameter based on a decaying average of the squared past gradients (vt), which effectively uses the L2 norm. Adamax, introduced in the same paper as Adam by Kingma and Ba, explores using the L∞ norm (infinity norm, or maximum norm) for scaling instead.
The motivation stems from generalizing Adam's update rule. Adam's per-parameter update involves dividing the first moment estimate m^t by v^t. The term vt is roughly proportional to the L2 norm of the current and past gradients. What happens if we generalize this to an Lp norm? The update component vt would involve the p-th power of gradients:
vt=β2vt−1+(1−β2)∣gt∣pThe parameter update would then involve division by vt1/p. Adam corresponds to p=2. Adamax considers the case where p→∞. As p→∞, the Lp norm converges to the L∞ norm (the maximum absolute value). This leads to a remarkably simple update for the scaling term, often denoted as ut to distinguish it from Adam's vt:
ut=max(β2ut−1,∣gt∣)Notice that this update doesn't require element-wise squaring or square roots. It relies on the max
operation. Because ut relies on a max
operation, it can be more stable than Adam's vt when dealing with very large but sparse gradients, as the max
is less affected by infrequent large values compared to the sum of squares.
The full Adamax parameter update rule becomes:
θt+1=θt−utηm^tHere, m^t is the bias-corrected first moment estimate, calculated exactly as in Adam:
mt=β1mt−1+(1−β1)gt m^t=1−β1tmtKey points about Adamax:
max
operation being less sensitive than summing squares.Adamax provides a computationally slightly simpler variant of Adam that might offer better stability in certain situations.
Nesterov Accelerated Gradient (NAG) is known to often improve the convergence speed of standard momentum methods by calculating the gradient after applying a preliminary momentum step (looking ahead). The Nadam (Nesterov-accelerated Adaptive Moment Estimation) optimizer, proposed by Dozat, aims to integrate this Nesterov momentum principle into the Adam framework.
Recall that Adam updates parameters using bias-corrected estimates of the first moment (mt) and second moment (vt) of the gradients. Nadam modifies how the first moment estimate influences the update step to incorporate the "lookahead" aspect of NAG.
The core idea in NAG is to compute the gradient at θt+μmt−1 (where μ is the momentum coefficient) instead of at θt. Nadam applies a similar correction within Adam's update. Instead of using only the previous momentum term β1m^t−1 to anticipate the next position, Nadam effectively incorporates the current gradient information more directly into the momentum part of the update.
Let's look at the update steps for Nadam:
Compare the Nadam update (Step 6) to the standard Adam update:
Adam Update: θt=θt−1−v^t+ϵηm^t=θt−1−v^t+ϵη(1−β1tβ1mt−1+1−β1t(1−β1)gt)The key difference lies in how the momentum term is applied. Nadam uses β1m^t which includes the influence of the current gradient gt via mt, combined with another term involving the current gradient 1−β1t(1−β1)gt. This structure effectively applies the momentum step using the updated momentum mt and incorporates the gradient correction in a way that mimics NAG's lookahead behavior.
Key points about Nadam:
Nadam is frequently a strong choice when faster convergence is desired and the overhead of the slightly more complex update calculation is acceptable.
Both Adamax and Nadam are readily available in popular machine learning libraries like TensorFlow (tf.keras.optimizers.Adamax
, tf.keras.optimizers.Nadam
) and PyTorch (torch.optim.Adamax
, torch.optim.Nadam
).
By understanding the subtle but significant modifications introduced by Adamax and Nadam, you gain more tools for navigating the optimization landscape, potentially achieving more stable or faster training for your machine learning models. They represent valuable refinements built upon the foundational Adam algorithm.
© 2025 ApX Machine Learning