While standard Stochastic Gradient Descent (SGD) and its momentum variant form the foundation of optimization, training large, complex models like LLMs often benefits from adaptive learning rate methods. These algorithms adjust the learning rate for each parameter individually, potentially leading to faster convergence, especially in settings with sparse gradients or varying gradient magnitudes across parameters, which are common in deep neural networks.Adam: Adaptive Moment EstimationOne of the most popular adaptive optimizers is Adam (Adaptive Moment Estimation). Adam computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients. It essentially combines the ideas of Momentum (using the first moment estimate, an exponentially decaying average of past gradients) and RMSprop (using the second moment estimate, an exponentially decaying average of past squared gradients).Let $\mathbf{g}_t$ be the gradient of the objective function with respect to the parameters $\theta$ at timestep $t$. Adam maintains two moving averages:First Moment Estimate (Momentum): $$ \mathbf{m}t = \beta_1 \mathbf{m}{t-1} + (1 - \beta_1) \mathbf{g}_t $$ This is an estimate of the mean of the gradients. $\beta_1$ is the exponential decay rate, typically close to 1 (e.g., 0.9).Second Moment Estimate (Uncentered Variance): $$ \mathbf{v}t = \beta_2 \mathbf{v}{t-1} + (1 - \beta_2) \mathbf{g}_t^2 $$ This is an estimate of the uncentered variance of the gradients (element-wise square). $\beta_2$ is the exponential decay rate, also typically close to 1 (e.g., 0.999).Since $\mathbf{m}_t$ and $\mathbf{v}_t$ are initialized as vectors of zeros, they are biased towards zero, especially during the initial timesteps. Adam corrects for this bias:$$ \hat{\mathbf{m}}_t = \frac{\mathbf{m}_t}{1 - \beta_1^t} $$ $$ \hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1 - \beta_2^t} $$ where $t$ is the current timestep index (starting from 1).Finally, the parameter update rule is:$$ \theta_t = \theta_{t-1} - \eta \frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon} $$ Here, $\eta$ is the base learning rate, and $\epsilon$ is a small constant (e.g., $10^{-8}$) added for numerical stability, primarily to prevent division by zero. The term $\frac{\eta}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon}$ acts as an effective, parameter-specific learning rate. Parameters with larger past gradients (larger $\hat{\mathbf{v}}_t$) receive smaller updates, while parameters with smaller past gradients receive larger updates.In PyTorch, using Adam is straightforward:import torch import torch.optim as optim # Assume model is a defined torch.nn.Module # learning_rate, beta1, beta2, epsilon are hyperparameters optimizer = optim.Adam( model.parameters(), lr=learning_rate, betas=(beta1, beta2), eps=epsilon ) # Inside the training loop: # loss = compute_loss(...) # optimizer.zero_grad() # loss.backward() # optimizer.step()AdamW: Decoupled Weight DecayWhile Adam works well in many situations, its handling of L2 regularization (weight decay) can be suboptimal. Standard L2 regularization adds a term $\frac{\lambda}{2} |\theta|^2$ to the loss function, resulting in a gradient term $\lambda \theta$ being added to $\mathbf{g}_t$. In Adam, this weight decay term $\lambda \theta$ becomes part of the adaptive learning rate calculation through $\mathbf{m}_t$ and $\mathbf{v}_t$. This means the effective weight decay applied to a parameter depends on the historical magnitude of its gradients (via $\sqrt{\hat{\mathbf{v}}_t}$). Parameters with large gradients experience smaller effective weight decay than intended, while parameters with small gradients experience larger effective weight decay.AdamW proposes a simple fix: decouple the weight decay from the gradient update. Instead of adding $\lambda \theta$ to the gradient $\mathbf{g}_t$, AdamW performs the standard Adam update using only the gradient from the primary loss function and then applies the weight decay directly to the parameters after the Adam step.The AdamW update rule looks like this:Calculate $\mathbf{m}_t$, $\mathbf{v}_t$, $\hat{\mathbf{m}}_t$, $\hat{\mathbf{v}}_t$ as in Adam, using only the gradients $\mathbf{g}_t$ from the loss function (without the $\lambda \theta$ term).Perform the adaptive update: $$ \mathbf{u}_t = \eta \frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon} $$Apply the weight decay and update the parameters: $$ \theta_t = \theta_{t-1} - \mathbf{u}t - \eta \lambda \theta{t-1} $$Notice the final term: $-\eta \lambda \theta_{t-1}$. The weight decay is applied directly to the previous weight value $\theta_{t-1}$ and is scaled only by the global learning rate $\eta$, not the adaptive rate. This makes the weight decay behave more like it does in standard SGD with momentum, leading to better generalization performance in many cases, particularly for deep models like Transformers.Using AdamW in PyTorch is similar to Adam, just requiring the weight_decay parameter:import torch import torch.optim as optim # Assume model is a defined torch.nn.Module # learning_rate, beta1, beta2, epsilon, weight_decay_lambda are hyperparameters optimizer = optim.AdamW( model.parameters(), lr=learning_rate, betas=(beta1, beta2), eps=epsilon, weight_decay=weight_decay_lambda # Note the weight_decay parameter ) # Inside the training loop: # loss = compute_loss(...) # optimizer.zero_grad() # loss.backward() # optimizer.step()Due to its improved handling of weight decay and strong empirical performance, AdamW has become a very common choice for training large language models. The choice between Adam and AdamW, along with the setting of their hyperparameters ($\eta, \beta_1, \beta_2, \epsilon, \lambda$), often depends on the specific model architecture, dataset, and training setup, requiring careful tuning discussed later in this chapter.