While the Adam optimizer has proven effective across a wide range of deep learning tasks, researchers have continued to explore variations aiming for improved stability or convergence speed. Two notable examples are Adamax and Nadam. Think of these not as radical departures, but as refinements building upon the successful foundation of Adam.
Adamax, introduced in the same paper as Adam, offers a different perspective on scaling the learning rate. Recall that Adam scales the learning rate for each parameter inversely proportional to the square root of an exponentially decaying average of past squared gradients (vt, representing an L2 norm). Adamax modifies this by using an exponentially decaying average based on the infinity norm (L∞) of past gradients.
Instead of accumulating squared gradients in vt, Adamax tracks ut, which is defined recursively:
ut=max(β2⋅ut−1,∣gt∣)Here, ut essentially keeps track of the maximum absolute value of the gradient seen so far (with exponential decay governed by β2). The parameter update then uses ut for scaling, replacing the v^t+ϵ term from Adam:
θt+1=θt−utαm^t(Where m^t is the bias-corrected first moment estimate, calculated similarly to Adam).
Why use the infinity norm? The update step based on ut can be seen as more stable compared to the L2 norm scaling in Adam, especially in situations with very sparse gradients or infrequent large gradients. Because ut depends on the maximum rather than the sum of squares, it doesn't grow as quickly if a few unusually large gradients occur, potentially preventing the learning rate from becoming excessively small too early. While often performing similarly to Adam, Adamax can sometimes be a useful alternative if you observe instability or very slow progress with Adam.
Nadam, short for Nesterov-accelerated Adaptive Moment Estimation, combines Adam with the Nesterov Accelerated Gradient (NAG) concept. You might recall from Chapter 5 that NAG improves standard momentum by calculating the gradient after applying the current velocity, effectively "looking ahead" to anticipate the parameter's future position.
Nadam integrates this look-ahead capability directly into the Adam update rule. It modifies the way the first moment estimate (mt) is used. Instead of applying the momentum step based on the previous estimate (mt−1), Nadam incorporates the current momentum estimate (mt) when calculating the update, achieving an effect similar to NAG's look-ahead.
The derivation involves substituting the bias-corrected estimate m^t with a Nesterov-enhanced version. While the exact formulas are slightly more involved, the core idea is to apply the momentum update more effectively by considering the effect of the momentum update itself when computing the parameter step.
This incorporation of Nesterov momentum often allows Nadam to converge slightly faster than Adam, particularly on tasks where NAG provides a significant benefit over standard momentum.
Adam remains the go-to adaptive optimizer for many practitioners due to its robustness and good general performance. However, Adamax and Nadam provide valuable alternatives in your optimization toolkit.
Both Adamax and Nadam are readily available in popular deep learning libraries like PyTorch and TensorFlow. Implementing them is typically as simple as changing the optimizer class name.
import torch
import torch.optim as optim
# Assuming 'model_parameters' holds your network's parameters
learning_rate = 1e-3
beta1 = 0.9
beta2_adamax = 0.999 # Standard beta2 for Adamax
beta2_nadam = 0.999 # Standard beta2 for Nadam
# Using Adam (for comparison)
# optimizer_adam = optim.Adam(model_parameters, lr=learning_rate, betas=(beta1, 0.999))
# Using Adamax
optimizer_adamax = optim.Adamax(model_parameters, lr=learning_rate, betas=(beta1, beta2_adamax))
# Using Nadam
optimizer_nadam = optim.Nadam(model_parameters, lr=learning_rate, betas=(beta1, beta2_nadam))
# Choose one optimizer for your training loop
# optimizer = optimizer_nadam
As with all optimization choices, the best performer often depends on the specific dataset, model architecture, and hyperparameter settings. Empirical evaluation through careful experimentation is the most reliable way to determine if Adamax or Nadam offers an advantage over Adam for your particular deep learning challenge.
© 2025 ApX Machine Learning