Masterclass
Selecting an appropriate learning rate η is fundamental for successful model training, but using a fixed learning rate throughout the entire training process is often suboptimal, especially for large, complex models like Transformers. Early in training, the model's parameters are far from optimal, and large gradients might cause instability or divergence if the learning rate is too high. Conversely, later in training, as the model approaches convergence, a smaller learning rate is typically needed for fine-tuning and finding a good minimum in the loss landscape. Learning rate scheduling addresses this by dynamically adjusting the learning rate ηt at each training step t.
For large language models, a common and effective strategy involves a combination of a warmup phase followed by a decay phase.
During the initial stages of training, especially when using adaptive optimizers like Adam or AdamW, the variance estimates can be unreliable due to the limited number of samples observed. Furthermore, the initial gradients might be large and noisy as the randomly initialized model makes significant errors. A large learning rate applied immediately can lead to numerical instability or cause the model to diverge.
To mitigate this, a warmup period is often employed. During this phase, the learning rate is gradually increased from a very small value (often 0) to its target peak value ηpeak over a predefined number of steps, warmup_steps
. The most common approach is linear warmup:
This gradual increase allows the adaptive moments in optimizers like AdamW to stabilize and prevents large, potentially destabilizing updates early in training. The number of warmup_steps
is a hyperparameter, often set to a few thousand steps or a small percentage (e.g., 1-10%) of the total training steps, depending on the dataset size and batch size.
Once the warmup phase is complete and the learning rate reaches ηpeak, it's typically beneficial to gradually decrease it over the remaining training steps. This allows the model to settle into a good minimum in the loss landscape. Several decay strategies are common:
The learning rate decreases linearly from ηpeak to a minimum value ηmin (often 0) over the steps from warmup_steps
to the total number of training steps T.
This is a very popular strategy for training large models. The learning rate follows a cosine curve from ηpeak down to ηmin. It decreases slowly at first, then faster, and then slows down again as it approaches ηmin. This smooth decay profile is often found to work well in practice.
ηt=ηmin+21(ηpeak−ηmin)(1+cos(πT−warmup_stepst−warmup_steps))for warmup_steps≤t≤TOther schedules like inverse square root decay (ηt∝1/t) or polynomial decay are also used, but linear and cosine decays (especially cosine) are very common choices for pre-training large Transformers.
PyTorch provides flexible learning rate scheduling tools within torch.optim.lr_scheduler
. You can implement custom schedules using LambdaLR
or use built-in schedulers. Libraries like Hugging Face's transformers
also offer convenient helper functions.
Here's how you might define a scheduler function for linear warmup followed by cosine decay, suitable for use with LambdaLR
:
import math
import torch
from torch.optim import AdamW
from torch.optim.lr_scheduler import LambdaLR
# Assume optimizer is already defined:
# model = YourTransformerModel(...)
# optimizer = AdamW(
# model.parameters(),
# lr=peak_lr,
# betas=(0.9, 0.98),
# eps=1e-6,
# weight_decay=0.1
# )
# Configuration
num_training_steps = 100000 # Example total steps
num_warmup_steps = 10000 # Example warmup steps
peak_lr = 3e-4 # Target peak learning rate
# (optimizer initial lr)
min_lr = 3e-5 # Target minimum learning rate
def lr_lambda(current_step: int):
# Linear warmup phase
if current_step < num_warmup_steps:
return float(current_step) / float(max(1, num_warmup_steps))
# Cosine decay phase
progress = (float(current_step - num_warmup_steps) /
float(max(1, num_training_steps - num_warmup_steps)))
# Calculate cosine decay factor, ensuring it doesn't go
# below min_lr / peak_lr
cosine_decay = 0.5 * (1.0 + math.cos(math.pi * progress))
# Combine cosine decay with the floor set by min_lr
decay_factor = cosine_decay * (1 - min_lr / peak_lr) + (min_lr / peak_lr)
return decay_factor
# Create the scheduler
lr_scheduler = LambdaLR(optimizer, lr_lambda)
# Training loop excerpt:
# for step, batch in enumerate(dataloader):
# ... perform forward pass, backward pass ...
# optimizer.step()
# lr_scheduler.step() # Update learning rate
# optimizer.zero_grad()
# if step >= num_training_steps:
# break
Alternatively, using the transformers
library simplifies this:
from transformers import get_scheduler, AdamW
# Configuration (as before)
num_training_steps = 100000
num_warmup_steps = 10000
peak_lr = 3e-4
# model = YourTransformerModel(...)
# optimizer = AdamW(model.parameters(),
# lr=peak_lr,
# betas=(0.9, 0.98),
# eps=1e-6,
# weight_decay=0.1)
# Get the scheduler
lr_scheduler = get_scheduler(
name="cosine", # Can also be "linear", "polynomial", etc.
optimizer=optimizer,
num_warmup_steps=num_warmup_steps,
num_training_steps=num_training_steps
# Note: Cosine decay in transformers typically decays to 0
# by default.
# Customizing min_lr might require a custom LambdaLR approach
# or modifying the source.
)
# Training loop is the same: call lr_scheduler.step()
# after optimizer.step()
The following chart illustrates a typical learning rate schedule with linear warmup and cosine decay:
Learning rate profile showing a linear increase over 10,000 warmup steps to a peak of 3e-4, followed by a cosine decay towards 0 over the remaining 90,000 steps.
Choosing the right schedule, warmup_steps
, peak_lr
, and min_lr
often involves some experimentation, but the combination of warmup and a subsequent decay (particularly cosine decay) is a robust starting point for training large language models. It provides stability early on and allows for effective convergence later in training.
© 2025 ApX Machine Learning