Once you have prepared your data batches and defined a suitable loss function, the next step is optimizing the model's parameters to minimize that loss. Training deep neural networks, especially large ones like Transformers, requires effective optimization algorithms and strategies. Simply applying standard stochastic gradient descent (SGD) with a fixed learning rate often leads to slow convergence or suboptimal results. Transformers, in particular, benefit from more sophisticated optimization techniques.
The most common optimizer used for training Transformer models is Adam (Adaptive Moment Estimation). Adam combines the advantages of two other popular optimization extensions: RMSProp (which adapts learning rates based on the magnitude of recent gradients) and Momentum (which helps accelerate gradients vectors in the right direction, leading to faster convergence).
Here's the core idea behind Adam:
The update rule conceptually involves calculating these biased first and second moment estimates, correcting for their bias (especially important early in training), and then using these corrected estimates to update the model parameters. The update for a parameter θ at timestep t looks roughly like:
θt+1=θt−v^t+ϵηm^tWhere η is the base learning rate, m^t and v^t are the bias-corrected first and second moment estimates, and ϵ is a small constant added for numerical stability (typically 1e−8 or 1e−9).
Adam is generally preferred for Transformers because it performs well across a wide range of problems, is computationally efficient, has low memory requirements, and is relatively robust to the choice of hyperparameters (though tuning is still beneficial). Common choices for the exponential decay rates for the moment estimates are β1=0.9 and β2=0.999. The original Transformer paper used β1=0.9, β2=0.98, and ϵ=10−9.
While Adam adapts the learning rate per parameter, the overall global learning rate (η in the conceptual formula above) is also critically important. Transformers are known to be sensitive to the learning rate, and using a fixed learning rate throughout training is often ineffective. Instead, a learning rate schedule is typically employed.
The most widely adopted schedule for Transformers involves a combination of a linear "warmup" phase followed by a decay phase.
Warmup: Training starts with a very small learning rate (or even zero). The learning rate is then increased linearly for a specific number of initial training steps, known as warmup_steps
. The purpose of this warmup is to prevent instability early in training. When the model parameters are randomly initialized, gradients can be very large and erratic. A large learning rate at the beginning could cause the optimization process to diverge. Gradually increasing the learning rate allows the model to stabilize before larger updates are applied.
Decay: After the warmup phase reaches a peak learning rate, the learning rate is gradually decreased for the remainder of training. This allows for finer adjustments as the model converges towards a minimum. The original Transformer paper used an inverse square root decay function.
The formula often used for this schedule, combining warmup and decay, is:
lr=dmodel−0.5⋅min(step_num−0.5,step_num⋅warmup_steps−1.5)Here, dmodel is the dimensionality of the model's embeddings (e.g., 512), step_num
is the current training step number, and warmup_steps
is the duration of the warmup phase (e.g., 4000 steps). This formula effectively implements the linear warmup followed by the inverse square root decay.
A typical learning rate schedule for Transformers, showing a linear warmup phase (here, 4000 steps) followed by an inverse square root decay. The peak learning rate depends on the model dimension and warmup steps.
Other schedules, like cosine decay with warmup or linear decay after warmup, are also used in practice. The choice often depends on the specific task and dataset. Libraries like Hugging Face's transformers
provide implementations for various common learning rate schedulers.
Finding the optimal optimization strategy often requires tuning hyperparameters. For the Adam optimizer, you might adjust β1, β2, and ϵ, although the defaults (or values used in influential papers like 0.9
, 0.98
, 1e-9
) are often a good starting point.
For the learning rate schedule, the warmup_steps
and the peak learning rate (or a scaling factor applied to the schedule) are the most important parameters to tune. A common range for warmup_steps
is a few thousand steps (e.g., 1000 to 10000), often representing a small percentage of the total training steps. The peak learning rate typically needs careful tuning; values often range from 10−5 to 10−3, depending on the model size, batch size, and dataset.
Experimentation is usually required to find the best combination of optimizer settings and learning rate schedule for your specific Transformer model and task. Monitoring training and validation loss curves is essential during this process.
In summary, using the Adam optimizer combined with a learning rate schedule featuring a warmup and decay phase is the standard and highly effective approach for training Transformer models. While default parameters provide a reasonable starting point, tuning these hyperparameters can significantly impact training stability and final model performance.
© 2025 ApX Machine Learning