While FP16 offers significant memory savings and potential speedups, its limited dynamic range ( to ) often necessitates careful handling through techniques like loss scaling. An alternative 16-bit format, BFloat16 (Brain Floating Point, or ), was developed specifically to address this range limitation, trading off some precision for a much wider dynamic range comparable to FP32.
The main difference between and lies in how they allocate their 16 bits between the exponent and the mantissa (or significand).
Notice that uses the same number of exponent bits (8) as the standard 32-bit format. This gives the same dynamic range as (approximately to ), drastically reducing the risk of gradients or activations overflowing or underflowing during training. However, this comes at the cost of precision, as only has 7 mantissa bits compared to 's 10 and 's 23.
We can visualize the bit allocation difference:
Comparison of bit allocation in FP16 and BF16 formats. BF16 prioritizes range (exponent bits) over precision (mantissa bits).
The primary advantage of is improved training stability for large models compared to . Because its dynamic range matches , the likelihood of encountering numerical overflow or underflow in intermediate calculations (like gradients or activations) is significantly lower. This often means that complex dynamic loss scaling, while still potentially useful, might not be strictly necessary or can be configured less aggressively than typically required for stable training.
The trade-off is the reduced precision due to fewer mantissa bits. While deep neural networks have often been observed to be resilient to some level of reduced precision, this difference could potentially impact convergence speed or final model accuracy for certain sensitive tasks or architectures compared to or . However, for many large language model training scenarios, the stability benefits of 's wider range outweigh the potential precision concerns.
Another practical consideration is hardware support. was initially introduced on Google TPUs. NVIDIA added support starting with its Ampere architecture (A100 GPUs) and subsequent generations (e.g., Hopper H100). Older GPUs may not support operations efficiently, making the only viable 16-bit option.
Similar to , modern deep learning frameworks provide convenient wrappers for using within automatic mixed precision (AMP) contexts. In PyTorch, enabling is straightforward using torch.autocast.
import torch
import torch.nn as nn
from torch.cuda.amp import autocast, GradScaler
# Assume model, optimizer, data_loader are defined
model = nn.Linear(1024, 1024).cuda() # Example model layer on GPU
optimizer = torch.optim.AdamW(model.parameters())
# Dummy data
data_loader = [
(torch.randn(64, 1024).cuda(), torch.randn(64, 1024).cuda())
for _ in range(10)
]
# Use BF16 if available, otherwise fall back
# (or use FP16 if preferred/available)
# Note: BF16 requires CUDA >= 11.0 and Ampere GPU or newer (or TPU)
use_bf16 = (
torch.cuda.is_available()
and torch.cuda.is_bf16_supported()
)
# GradScaler is often optional for BF16 due to its wider range,
# but can still be used for consistency or if encountering instabilities.
# It becomes a no-op if disabled. Enabled=False can be useful for BF16.
# Set enabled=True if loss scaling is desired/needed
scaler = GradScaler(enabled=False)
print(f"Using BF16: {use_bf16}")
for data, target in data_loader:
optimizer.zero_grad()
# Automatic Mixed Precision context manager
# Set dtype to torch.bfloat16
with autocast(
device_type='cuda',
dtype=torch.bfloat16,
enabled=use_bf16
):
output = model(data)
loss = nn.functional.mse_loss(output, target)
# Scale loss and backward pass
# scaler.scale(loss) does nothing if scaler is disabled
scaler.scale(loss).backward()
# Optimizer step (unscales gradients if scaler is enabled)
scaler.step(optimizer)
# Update scaler for next iteration (optional for BF16 if disabled)
scaler.update()
print(f"Loss: {loss.item():.4f}")
In this example:
torch.cuda.is_bf16_supported().GradScaler but set enabled=False. While loss scaling can be used with , it's often unnecessary due to the format's range. Disabling it simplifies the code and removes the overhead associated with scaling checks. You might enable it if you still observe stability issues, though this is less common than with .torch.autocast is used with dtype=torch.bfloat16. This instructs PyTorch to automatically cast operations within the block to where safe and beneficial (typically matrix multiplications and convolutions), while keeping other operations (like reductions) in for numerical stability.scaler.scale(loss).backward() and scaler.step(optimizer) calls behave correctly whether the scaler is enabled or disabled. If disabled, they essentially become pass-through operations for the loss and optimizer step.The choice between and relies on several factors:
In practice, if your hardware supports , it is often the preferred choice for training large language models due to its ease of use and inherent stability, simplifying the mixed-precision training setup. It provides nearly the same memory and speed benefits as but without the significant headache of managing a narrow dynamic range.
Was this section helpful?
torch.autocast and GradScaler with bfloat16.© 2026 ApX Machine LearningEngineered with