While the computational speed and memory savings offered by 16-bit floating-point (FP16) are attractive, switching from the standard 32-bit precision (FP32) is not without its difficulties. The primary challenge stems from the significantly narrower dynamic range of the FP16 format. Understanding this limitation is fundamental to successfully implementing mixed-precision training.

FP32 numbers use 1 sign bit, 8 exponent bits, and 23 significand (mantissa) bits. This allows them to represent a vast range of values, roughly from $1.18 \times 10^{-38}$ to $3.4 \times 10^{38}$ . In contrast, the IEEE 754 standard for FP16 uses 1 sign bit, 5 exponent bits, and 10 significand bits. This allocation drastically reduces the range of representable numbers. The smallest positive normalized FP16 number is $2^{-14} \approx 6.1 \times 10^{-5}$ , and the largest is $(2 - 2^{-10}) \times 2^{15} = 65504$ . Numbers smaller than $2^{-14}$ might be representable as subnormal (or denormal) values, offering a gradual underflow down to approximately $6.0 \times 10^{-8}$ , but these often come with performance penalties on hardware and still represent a much smaller range compared to FP32.

Comparison of approximate maximum and minimum positive normalized values for FP32, FP16, and BF16 (BFloat16) formats on a logarithmic scale. Note the significantly smaller range of FP16.

This limited range presents two major problems during deep learning training:

Underflow to Zero: Gradients, especially in deep networks or for parameters updated infrequently, can become very small. If a gradient's magnitude falls below the minimum representable positive FP16 value (around $6.1 \times 10^{-5}$ ), it gets rounded down to zero. When gradients become zero, the corresponding weights are no longer updated. This effectively halts learning for parts of the model, potentially preventing convergence or leading to suboptimal results. Imagine a scenario where the true gradient calculated in FP32 is $5 \times 10^{-5}$ . In FP16, this value cannot be represented accurately and might become zero, losing the update signal entirely.
Overflow to Infinity: Conversely, activations or large gradients, particularly if intermediate calculations within optimizers or loss functions produce large values, can exceed the maximum representable FP16 value (65504). When this occurs, the value overflows to infinity (Inf). Operations involving Inf often result in Not-a-Number (NaN) values (e.g., Inf - Inf, 0 * Inf). Once NaNs appear in the model's weights, activations, or the loss calculation, they tend to propagate rapidly, corrupting the entire training process and causing it to collapse.

Consider a simple example in PyTorch:

import torch

# Example of potential underflow
small_value_fp32 = torch.tensor(5e-5, dtype=torch.float32)
small_value_fp16 = small_value_fp32.half() # Convert to FP16
print(f"FP32 value: {small_value_fp32}")
print(f"FP16 value: {small_value_fp16}")
# Might become 0.0 depending on exact FP16 implementation details

# Example of potential overflow
large_value_fp32 = torch.tensor(70000.0, dtype=torch.float32)
large_value_fp16 = large_value_fp32.half() # Convert to FP16
print(f"FP32 value: {large_value_fp32}")
print(f"FP16 value: {large_value_fp16}") # Becomes inf

These range issues are particularly relevant for LLMs due to their depth and complexity. Activations deep within the network or gradients calculated through long backpropagation paths can easily fall outside the narrow FP16 window. Furthermore, techniques like gradient accumulation, often used in LLM training, can increase the magnitude of summed gradients, risking overflow if not handled carefully.

While these challenges might seem daunting, they don't negate the benefits of FP16. The main point is to employ stabilization techniques, most notably loss scaling, which we will discuss next. It's also worth noting that the BFloat16 format, which we will cover later in this chapter, sacrifices precision compared to FP16 but retains the same wide dynamic range as FP32, largely circumventing these specific underflow and overflow problems, albeit with potential impacts on convergence due to lower precision. Successfully navigating FP16 training requires managing this delicate balance between speed, memory, and numerical stability.

Was this section helpful?