As we discussed in the chapter introduction, training large language models often hits computational walls. Standard 32-bit floating-point numbers ( $FP32$ ), while accurate, consume significant memory and compute resources. Mixed-precision training uses lower-precision formats, like 16-bit floating-point ( $FP16$ ) and bfloat16 ( $BF16$ ), to mitigate these issues. Understanding the characteristics of these formats is the first step towards applying them effectively.

Floating-point numbers are the computer's way of approximating real numbers. Each format uses a fixed number of bits, typically divided into three parts: a sign bit (determining if the number is positive or negative), an exponent (determining the magnitude or range of the number), and a mantissa or significand (determining the precision or the significant digits). The trade-offs between these formats boil down to how they allocate bits between the exponent and mantissa.

FP32 (Single Precision)

The IEEE 754 standard single-precision floating-point format, commonly known as $FP32$ , uses 32 bits. This is often the default format in deep learning frameworks and scientific computing.

Structure: 1 sign bit, 8 exponent bits, 23 mantissa bits.
Range: The 8 exponent bits provide a wide dynamic range, roughly from $10^{-38}$ to $10^{38}$ . This range is usually sufficient to represent activations and gradients during model training without frequent overflow (numbers too large) or underflow (numbers too small to represent accurately, becoming zero).
Precision: The 23 mantissa bits offer substantial precision, capable of representing approximately 7 decimal digits accurately.

While reliable, $FP32$ demands considerable resources. Each parameter requires 4 bytes of storage, and computations involving $FP32$ numbers are computationally intensive. For models with billions of parameters, this quickly becomes a bottleneck.

import torch

# Create an FP32 tensor (default in PyTorch)
fp32_tensor = torch.tensor([1.0, 2.0, 3.0])
print(f"Data type: {fp32_tensor.dtype}")
print(f"Memory per element (bytes): {fp32_tensor.element_size()}")

# Output:
# Data type: torch.float32
# Memory per element (bytes): 4

FP16 (Half Precision)

The IEEE 754 half-precision format, or $FP16$ , cuts the bit count in half compared to $FP32$ , using only 16 bits.

Structure: 1 sign bit, 5 exponent bits, 10 mantissa bits.
Benefits:
- Memory Reduction: Uses only 2 bytes per number, halving the memory required for model weights, gradients, and activations compared to $FP32$ .
- Speed Increase: Operations can be significantly faster on hardware equipped with specialized units, such as NVIDIA's Tensor Cores, which perform $FP16$ matrix multiplications at a much higher throughput than $FP32$ operations.
Drawbacks:
- Limited Range: The most significant issue with $FP16$ is its drastically reduced dynamic range due to only 5 exponent bits (roughly $10^{-5}$ to $65504$ ). This makes it susceptible to overflow (gradients becoming Inf) or underflow (small gradients becoming zero, halting learning).
- Reduced Precision: Fewer mantissa bits mean less precision (around 3-4 decimal digits). While often sufficient, this can sometimes hinder convergence for sensitive operations or models.

The limited range necessitates careful handling, often requiring techniques like gradient scaling (which we will cover later in this chapter) to keep values within the representable $FP16$ range.

import torch

# Create an FP16 tensor
fp16_tensor = torch.tensor([1.0, 2.0, 3.0]).half() # or .to(torch.float16)
print(f"Data type: {fp16_tensor.dtype}")
print(f"Memory per element (bytes): {fp16_tensor.element_size()}")

# Demonstrate range issue (underflow)
small_val_fp32 = torch.tensor(1e-7, dtype=torch.float32)
small_val_fp16 = small_val_fp32.half()
print(f"Small value in FP32: {small_val_fp32}")
print(f"Small value in FP16: {small_val_fp16}") # Likely becomes 0.0

# Output:
# Data type: torch.float16
# Memory per element (bytes): 2
# Small value in FP32: 9.999999974752427e-08
# Small value in FP16: 0.0

BF16 (BFloat16)

BFloat16, or $BF16$ , is another 16-bit format, developed by Google Brain. It offers a different trade-off compared to $FP16$ .

Structure: 1 sign bit, 8 exponent bits, 7 mantissa bits.
Feature: $BF16$ preserves the 8 exponent bits from $FP32$ . This gives it the same dynamic range as $FP32$ , effectively eliminating the common overflow and underflow issues encountered with $FP16$ .
Trade-off: To maintain the $FP32$ range within 16 bits, $BF16$ sacrifices mantissa bits, leaving only 7 bits for precision (compared to 10 in $FP16$ and 23 in $FP32$ ). This results in lower precision (around 2-3 decimal digits).
Benefits:
- Stability: The wide dynamic range greatly improves training stability compared to $FP16$ , often removing the need for techniques like loss scaling.
- Memory Reduction: Like $FP16$ , it halves memory usage compared to $FP32$ .
- Speed Increase: Can be accelerated on compatible hardware (TPUs, newer NVIDIA GPUs like Ampere and Hopper).

The lower precision of $BF16$ is generally found to be acceptable for deep learning training, where the robustness to range variations is often more important than extremely high precision. It has become a popular choice for mixed-precision training, particularly for large models.

import torch

# Check if BF16 is available (requires specific hardware/PyTorch version)
bf16_available = torch.cuda.is_bf16_supported()
print(f"BF16 Available: {bf16_available}")

if bf16_available:
    # Create a BF16 tensor
    bf16_tensor = (torch.tensor([1.0, 2.0, 3.0])
                   .bfloat16()) # or .to(torch.bfloat16)
    print(f"Data type: {bf16_tensor.dtype}")
    print(
        f"Memory per element (bytes): {bf16_tensor.element_size()}"
    )

    # Demonstrate range (handles small values better than FP16)
    small_val_bf16 = small_val_fp32.bfloat16()
    print(
        f"Small value in BF16: {small_val_bf16}"
    ) # Represents the magnitude, less precision

    # Example Output (if BF16 is supported):
    # BF16 Available: True
    # Data type: torch.bfloat16
    # Memory per element (bytes): 2
    # Small value in BF16: 9.9609375e-08
else:
    print("BF16 not supported on this hardware/PyTorch "
          "build.")

Summary Comparison

Here's a quick summary of the main characteristics:

Feature	FP32 (Single)	FP16 (Half)	BF16 (BFloat16)
Total Bits	32	16	16
Sign Bits	1	1	1
Exponent Bits	8	5	8
Mantissa Bits	23	10	7
Dynamic Range	Wide	Narrow	Wide (like FP32)
Precision	High	Medium	Low
Memory / Value	4 Bytes	2 Bytes	2 Bytes
Stability Risk	Low	High (Range)	Low (Precision)
Hardware Support	Universal	Widespread	Newer Accelerators

Understanding these fundamental differences is essential for making informed decisions when implementing mixed-precision training. While $FP16$ and $BF16$ both offer memory and potential speed benefits, their different range and precision characteristics lead to different stability considerations and performance trade-offs, which we will explore in the subsequent sections.

Was this section helpful?