The Adam optimizer provides a combination of momentum and adaptive scaling based on the second moment of gradients. Researchers have proposed several variants aiming for further improvements or addressing specific behaviors. Two notable variants are Adamax and Nadam. These methods modify Adam's core components, specifically how it scales the learning rate or incorporates momentum, offering alternative approaches for certain optimization scenarios.
Adam scales the learning rate for each parameter based on a decaying average of the squared past gradients (), which effectively uses the norm. Adamax, introduced in the same paper as Adam by Kingma and Ba, uses the norm (infinity norm, or maximum norm) for scaling instead.
The motivation stems from generalizing Adam's update rule. Adam's per-parameter update involves dividing the first moment estimate by . The term is roughly proportional to the norm of the current and past gradients. What happens if we generalize this to an norm? The update component would involve the -th power of gradients:
The parameter update would then involve division by . Adam corresponds to . Adamax considers the case where . As , the norm converges to the norm (the maximum absolute value). This leads to a remarkably simple update for the scaling term, often denoted as to distinguish it from Adam's :
Notice that this update doesn't require element-wise squaring or square roots. It relies on the max operation. Because relies on a max operation, it can be more stable than Adam's when dealing with very large but sparse gradients, as the max is less affected by infrequent large values compared to the sum of squares.
The full Adamax parameter update rule becomes:
Here, is the bias-corrected first moment estimate, calculated exactly as in Adam:
Important Points about Adamax:
max operation being less sensitive than summing squares.Adamax provides a computationally slightly simpler variant of Adam that might offer better stability in certain situations.
Nesterov Accelerated Gradient (NAG) is known to often improve the convergence speed of standard momentum methods by calculating the gradient after applying a preliminary momentum step (looking ahead). The Nadam (Nesterov-accelerated Adaptive Moment Estimation) optimizer, proposed by Dozat, aims to integrate this Nesterov momentum principle into the Adam framework.
Recall that Adam updates parameters using bias-corrected estimates of the first moment () and second moment () of the gradients. Nadam modifies how the first moment estimate influences the update step to incorporate the "lookahead" aspect of NAG.
The core idea in NAG is to compute the gradient at (where is the momentum coefficient) instead of at . Nadam applies a similar correction within Adam's update. Instead of using only the previous momentum term to anticipate the next position, Nadam effectively incorporates the current gradient information more directly into the momentum part of the update.
Let's look at the update steps for Nadam:
Compare the Nadam update (Step 6) to the standard Adam update:
The main difference lies in how the momentum term is applied. Nadam uses which includes the influence of the current gradient via , combined with another term involving the current gradient . This structure effectively applies the momentum step using the updated momentum and incorporates the gradient correction in a way that mimics NAG's lookahead behavior.
Important Points about Nadam:
Nadam is frequently a strong choice when faster convergence is desired and the overhead of the slightly more complex update calculation is acceptable.
Both Adamax and Nadam are readily available in popular machine learning libraries like TensorFlow (tf.keras.optimizers.Adamax, tf.keras.optimizers.Nadam) and PyTorch (torch.optim.Adamax, torch.optim.Nadam).
By understanding the subtle but significant modifications introduced by Adamax and Nadam, you gain more tools for navigating optimization, potentially achieving more stable or faster training for your machine learning models. They represent valuable refinements built upon the foundational Adam algorithm.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with