Masterclass
Training large neural networks, especially deep Transformers, can sometimes lead to numerical instability. One common issue is the "exploding gradient" problem, where the magnitude of the gradients becomes excessively large during backpropagation. This can cause the model parameters to undergo huge updates, potentially leading to divergence (where the loss shoots up to infinity or NaN - Not a Number) or oscillations that prevent convergence. This instability can be particularly prevalent in deep networks or when using certain activation functions or initialization schemes, even with techniques like Layer Normalization discussed earlier. It can also be exacerbated during mixed-precision training (covered in Chapter 20) due to the limited numerical range of lower-precision formats.
Gradient clipping is a straightforward yet effective technique used to mitigate this problem by constraining the magnitude of the gradients before they are used by the optimizer to update the model weights. The core idea is not to change the direction of the gradient update but rather to limit its size if it exceeds a predefined threshold.
The most widely used method for LLMs is clipping by the L2 norm (Euclidean norm) of the gradients. This approach considers the entire set of gradients for all model parameters (or sometimes gradients per parameter group) as a single vector, calculates its L2 norm, and rescales the vector if its norm exceeds a specified threshold, c.
Mathematically, let g represent the vector of all gradients concatenated together. The L2 norm is calculated as:
∥g∥=i∑gi2The clipping operation is then applied as:
g←∥g∥cgif ∥g∥>cIf the norm ∥g∥ is less than or equal to the threshold c, the gradients remain unchanged. If the norm is greater than c, the gradient vector g is scaled down by a factor of c/∥g∥, ensuring its new norm is exactly c. This preserves the direction of the gradient update while limiting its magnitude.
In PyTorch, this is commonly implemented using the torch.nn.utils.clip_grad_norm_
function. It's applied after the backward pass (which computes the gradients) and before the optimizer step (which updates the weights based on the gradients).
import torch
from torch.nn.utils import clip_grad_norm_
# Assume model, loss, optimizer are defined
# ... inside training loop ...
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward() # Calculate gradients
# Define the maximum gradient norm threshold
max_grad_norm = 1.0
# Clip the gradients
total_norm = clip_grad_norm_(
model.parameters(),
max_norm=max_grad_norm,
norm_type=2.0
)
# Optional: Log the gradient norm before clipping if needed
# print(f"Gradient norm before clipping: {total_norm}")
optimizer.step() # Update weights using (potentially clipped) gradients
# ... rest of training loop ...
In this snippet, clip_grad_norm_
calculates the total L2 norm of all gradients for the parameters passed to it (model.parameters()
). If this norm exceeds max_norm
(our threshold c), it modifies the gradients in-place by rescaling them. The function returns the original total norm before clipping, which can be useful for monitoring training dynamics. Setting norm_type=2.0
explicitly specifies the L2 norm.
An alternative, though less common for training large Transformers, is clipping by value. This method clips each individual gradient component gi independently if it falls outside a specified range [−c,c].
gi←max(min(gi,c),−c)This means any gradient component larger than c is set to c, and any component smaller than −c is set to −c. Unlike norm clipping, this method can change the direction of the overall gradient vector because different components might be clipped differently or not at all.
In PyTorch, this can be done using torch.nn.utils.clip_grad_value_
:
import torch
from torch.nn.utils import clip_grad_value_
# ... inside training loop, after loss.backward() ...
# Define the maximum absolute value for each gradient component
clip_value = 0.5
# Clip gradients by value
clip_grad_value_(model.parameters(), clip_value=clip_value)
optimizer.step()
# ... rest of training loop ...
While simpler, clipping by value is often considered less theoretically grounded for deep learning optimization compared to norm clipping, as it doesn't treat the gradient as a unified direction vector. For most LLM training scenarios, clip_grad_norm_
is the preferred method.
max_norm
or clip_value
) is a hyperparameter that usually requires empirical tuning. A common starting point for max_norm
in LLM training is 1.0.
clip_grad_norm_
before clipping occurs) can help inform this choice. If the norm frequently hits the threshold, you might consider if the learning rate is too high or if the threshold could be slightly increased. If the norm rarely approaches the threshold, it might not be having much effect.The following diagram illustrates the effect of clipping by norm. Gradient vectors outside the circle (representing the norm threshold c) are scaled down radially towards the circle's boundary, preserving their original direction.
Gradient vectors g2 and g3 originally have norms greater than the threshold c. Clipping by norm scales them down (dashed blue arrows) to lie on the boundary defined by c, while g1, already within the threshold, remains unchanged.
By preventing excessively large updates, gradient clipping contributes significantly to the stability required for successfully training large language models, allowing optimizers like AdamW and well-designed learning rate schedules to effectively navigate the complex loss surface.
© 2025 ApX Machine Learning