Masterclass
As discussed previously, the primary challenge when using the FP16 format is its limited dynamic range compared to FP32. While FP16 offers significant memory and potential speed benefits, gradients calculated during backpropagation can easily fall outside its representable range. Small gradients might become zero (underflow), losing important update information, while large gradients might become infinity or Not-a-Number (NaN) (overflow), crashing the training process. Loss scaling is a critical technique designed specifically to mitigate these issues, particularly gradient underflow.
The core idea is straightforward: if small gradients are prone to underflow in FP16, we can artificially inflate them before they are converted to FP16 during backpropagation. We achieve this by multiplying the computed loss value by a scaling factor, S, before initiating the backward pass.
Consider the standard backpropagation process where gradients g are computed with respect to the loss L: g=∂w∂L​. With loss scaling, we compute gradients gscaled​ with respect to a scaled loss Lscaled​=S×L: gscaled​=∂w∂(S×L)​=S×∂w∂L​=S×g This scaling operation effectively pushes the gradient values higher, making it less likely that they will underflow when represented in FP16.
Of course, these scaled gradients (gscaled​) cannot be used directly by the optimizer, as they do not represent the true gradient of the original loss. Therefore, after backpropagation computes the gradients (potentially in FP16) but before the optimizer updates the model weights (which typically uses FP32 gradients), we must "unscale" the gradients by dividing them by the same factor S: g=Sgscaled​​ This recovers the original gradient magnitude, now hopefully without the information loss caused by FP16 underflow. This entire process happens within the training loop for each step.
There are two primary approaches to choosing the scaling factor S: static and dynamic loss scaling.
This is the simpler approach. You choose a fixed, constant scaling factor S at the beginning of training and use it throughout.
# PyTorch example for Static Loss Scaling
# Assume S is a pre-chosen constant scaling factor
S = 128.0
scaler = torch.cuda.amp.GradScaler(
init_scale=S, growth_interval=100000000
) # Force static
optimizer.zero_grad()
# Forward pass with autocasting
with torch.cuda.amp.autocast():
outputs = model(inputs)
loss = criterion(outputs, targets)
# Scale the loss manually (or let GradScaler handle it)
# scaled_loss = loss * S
# scaled_loss.backward()
# Use GradScaler to scale loss and call backward
scaler.scale(loss).backward()
# Unscale gradients before optimizer step
# Note: scaler.step implicitly handles unscaling
# if scaler found non-finite gradients, optimizer.step is skipped
scaler.step(optimizer)
# Update the scale for the next iteration (no-op for static)
scaler.update()
The main difficulty with static scaling is selecting an appropriate value for S.
Finding the optimal static S often requires manual tuning and experimentation, which can be time-consuming. It might also need adjustment if the gradient magnitudes change significantly during the course of training.
Dynamic loss scaling addresses the shortcomings of the static approach by automatically adjusting the scaling factor S during training. The typical algorithm works as follows:
backward()
pass, check the computed gradients (before unscaling) for overflow (presence of Inf
or NaN
values).growth_interval
) of consecutive steps:
This dynamic adjustment helps maintain the largest possible scaling factor that doesn't cause overflows, maximizing the protection against underflow without requiring manual tuning.
Modern deep learning frameworks provide utilities to handle this automatically. In PyTorch, torch.cuda.amp.GradScaler
implements dynamic loss scaling.
# PyTorch example using GradScaler for Dynamic Loss Scaling
# Initialize GradScaler with default dynamic behavior
# growth_interval determines how often it tries to increase the scale
scaler = torch.cuda.amp.GradScaler(
init_scale=65536.0, growth_interval=2000
)
for epoch in range(num_epochs):
for inputs, targets in dataloader:
optimizer.zero_grad()
# Forward pass with autocasting
with torch.cuda.amp.autocast():
outputs = model(inputs)
loss = criterion(outputs, targets)
# Scales loss, calls backward() on scaled loss
# to create scaled gradients.
scaler.scale(loss).backward()
# scaler.step() unscales the gradients held by
# optimizer's assigned params.
# If gradients contain infs or NaNs, optimizer.step() is skipped.
scaler.step(optimizer)
# Updates the scale for next iteration.
# Reduces scale if inf/NaN gradients were found,
# increases if interval passed.
scaler.update()
# ... rest of training loop ...
Using GradScaler
abstracts away the complexities of checking for overflows and adjusting S. You simply wrap the forward pass, loss scaling, backward pass, and optimizer step as shown.
Gradient clipping, another technique used to stabilize training (discussed in Chapter 17), is often used alongside mixed precision. It's important to perform gradient clipping correctly in conjunction with loss scaling. The standard practice is:
scaler.scale(loss).backward()
.scaler.unscale_(optimizer)
. This modifies the gradients associated with the optimizer's parameters in-place back to their original scale, but now in FP32.torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
.scaler.step(optimizer)
. Note that scaler.step
will not unscale again if scaler.unscale_
was already called.scaler.update()
.Recall that the bfloat16
(BF16) format has the same dynamic range as FP32, although with reduced precision. Because its range is much wider than FP16's, gradient overflow and underflow are significantly less common when using BF16. Consequently, loss scaling is often unnecessary when training with BF16 mixed precision. This simplifies the training setup compared to FP16, assuming your hardware provides efficient support for BF16 operations (like NVIDIA Ampere architecture GPUs and Google TPUs). However, monitoring gradients is still advisable, as extreme cases might still benefit from or require stabilization techniques.
In summary, loss scaling, particularly dynamic loss scaling as implemented in framework utilities like GradScaler
, is an essential technique for stable and effective mixed-precision training with the FP16 format. It counteracts the limited numerical range of FP16 by dynamically adjusting gradient magnitudes, allowing for significant memory savings and potential speedups without sacrificing training stability.
© 2025 ApX Machine Learning