Selecting an appropriate optimizer like Adam or AdamW is only part of the story for successfully training large Transformer models. The learning rate, a fundamental hyperparameter controlling the step size during optimization, often requires careful adjustment throughout the training process. Simply using a fixed learning rate is typically suboptimal and can lead to instability or slow convergence, especially given the complex loss surfaces and deep architectures involved. This is where learning rate scheduling becomes significant.
Learning rate scheduling defines a strategy for dynamically changing the learning rate during training. For Transformers, a common and highly effective approach involves a combination of a "warmup" phase followed by a "decay" phase.
During the initial stages of training, the model's parameters are randomly initialized, and the gradients can be large and erratic. Using a high learning rate from the start can cause significant updates that destabilize the training process, potentially leading the optimizer far away from good regions of the parameter space. This is particularly relevant for Transformers, where interactions between Layer Normalization, residual connections, and attention mechanisms can be sensitive to large parameter shifts early on.
The warmup phase addresses this by starting with a very small learning rate (often zero) and gradually increasing it over a predetermined number of initial training steps (the warmup_steps
). A linear increase is common:
Here, lrpeak is the maximum learning rate the schedule will reach after the warmup. This gradual increase allows the model to settle into a more stable state before larger parameter updates are applied, preventing early divergence and promoting smoother convergence.
Once the warmup phase is complete and the learning rate has reached its peak value (lrpeak), it's beneficial to gradually decrease it. This decay phase serves several purposes:
Several decay strategies are commonly employed after the warmup:
Inverse Square Root Decay: This strategy was used in the original "Attention Is All You Need" paper. After the warmup, the learning rate decreases proportionally to the inverse square root of the step number:
lrstep=lrpeak×stepwarmup_stepsfor step≥warmup_stepsAlternatively, and perhaps more commonly implemented referencing the paper's formula directly (using model dimension dmodel and a scaling factor):
lrstep=dmodel−0.5×min(step−0.5,step×warmup_steps−1.5)This formula combines both linear warmup and inverse square root decay into a single expression. Note that lrpeak is implicitly defined by dmodel−0.5 and the warmup_steps.
Cosine Decay (Cosine Annealing): The learning rate follows a cosine curve, decreasing from lrpeak down to a minimum value (often zero) over the remaining training steps. This provides a smooth, gradual decrease.
lrstep=lrmin+0.5×(lrpeak−lrmin)×(1+cos(total_steps−warmup_steps(step−warmup_steps)×π))Where lrmin is the target minimum learning rate (e.g., 0) and total_steps is the total number of training steps planned.
Linear Decay: The learning rate decreases linearly from lrpeak to a minimum value (often zero) over the remaining steps.
lrstep=lrmin+(lrpeak−lrmin)×total_steps−warmup_stepstotal_steps−stepExponential Decay: The learning rate is multiplied by a decay factor less than 1 at regular intervals or every step.
The choice of decay strategy can impact final model performance, and empirical evaluation is often necessary. Cosine decay and inverse square root decay are particularly popular choices for training Transformers.
The combination of warmup and decay creates a characteristic learning rate profile over time. The following chart illustrates a typical schedule with linear warmup followed by inverse square root decay.
Learning rate schedule with 10 warmup steps reaching a peak rate of 0.0005, followed by an inverse square root decay.
Most deep learning frameworks provide built-in support for learning rate schedulers that can be easily integrated with optimizers.
torch.optim.lr_scheduler
module offers various schedulers like LambdaLR
(for custom functions like inverse square root), CosineAnnealingLR
, LinearLR
, and SequentialLR
(for chaining warmup and decay). You typically call the scheduler's step()
method after each optimizer step()
.tf.keras.optimizers.schedules
module provides classes like PolynomialDecay
(can implement linear decay), CosineDecay
, and allows creating custom schedules by subclassing LearningRateSchedule
. These schedules are passed directly to the optimizer during initialization.The specific parameters of the schedule (lrpeak, warmup_steps, decay type, total_steps) are important hyperparameters that often need tuning based on the specific model size, dataset, and batch size being used. Selecting an appropriate schedule and tuning its parameters are necessary steps for achieving optimal performance and stable training with Transformer models.
© 2025 ApX Machine Learning