Masterclass
As we discussed in the chapter introduction, training large language models often hits computational walls. Standard 32-bit floating-point numbers (FP32), while accurate, consume significant memory and compute resources. Mixed-precision training leverages lower-precision formats, like 16-bit floating-point (FP16) and bfloat16 (BF16), to mitigate these issues. Understanding the characteristics of these formats is the first step towards applying them effectively.
Floating-point numbers are the computer's way of approximating real numbers. Each format uses a fixed number of bits, typically divided into three parts: a sign bit (determining if the number is positive or negative), an exponent (determining the magnitude or range of the number), and a mantissa or significand (determining the precision or the significant digits). The trade-offs between these formats boil down to how they allocate bits between the exponent and mantissa.
The IEEE 754 standard single-precision floating-point format, commonly known as FP32, uses 32 bits. This is often the default format in deep learning frameworks and scientific computing.
While reliable, FP32 demands considerable resources. Each parameter requires 4 bytes of storage, and computations involving FP32 numbers are computationally intensive. For models with billions of parameters, this quickly becomes a bottleneck.
import torch
# Create an FP32 tensor (default in PyTorch)
fp32_tensor = torch.tensor([1.0, 2.0, 3.0])
print(f"Data type: {fp32_tensor.dtype}")
print(f"Memory per element (bytes): {fp32_tensor.element_size()}")
# Output:
# Data type: torch.float32
# Memory per element (bytes): 4
The IEEE 754 half-precision format, or FP16, cuts the bit count in half compared to FP32, using only 16 bits.
Inf
) or underflow (small gradients becoming zero, halting learning).The limited range necessitates careful handling, often requiring techniques like gradient scaling (which we will cover later in this chapter) to keep values within the representable FP16 range.
import torch
# Create an FP16 tensor
fp16_tensor = torch.tensor([1.0, 2.0, 3.0]).half() # or .to(torch.float16)
print(f"Data type: {fp16_tensor.dtype}")
print(f"Memory per element (bytes): {fp16_tensor.element_size()}")
# Demonstrate range issue (underflow)
small_val_fp32 = torch.tensor(1e-7, dtype=torch.float32)
small_val_fp16 = small_val_fp32.half()
print(f"Small value in FP32: {small_val_fp32}")
print(f"Small value in FP16: {small_val_fp16}") # Likely becomes 0.0
# Output:
# Data type: torch.float16
# Memory per element (bytes): 2
# Small value in FP32: 9.999999974752427e-08
# Small value in FP16: 0.0
BFloat16, or BF16, is another 16-bit format, developed by Google Brain. It offers a different trade-off compared to FP16.
The lower precision of BF16 is generally found to be acceptable for deep learning training, where the robustness to range variations is often more important than extremely high precision. It has become a popular choice for mixed-precision training, particularly for large models.
import torch
# Check if BF16 is available (requires specific hardware/PyTorch version)
bf16_available = torch.cuda.is_bf16_supported()
print(f"BF16 Available: {bf16_available}")
if bf16_available:
# Create a BF16 tensor
bf16_tensor = (torch.tensor([1.0, 2.0, 3.0])
.bfloat16()) # or .to(torch.bfloat16)
print(f"Data type: {bf16_tensor.dtype}")
print(
f"Memory per element (bytes): {bf16_tensor.element_size()}"
)
# Demonstrate range (handles small values better than FP16)
small_val_bf16 = small_val_fp32.bfloat16()
print(
f"Small value in BF16: {small_val_bf16}"
) # Represents the magnitude, less precision
# Example Output (if BF16 is supported):
# BF16 Available: True
# Data type: torch.bfloat16
# Memory per element (bytes): 2
# Small value in BF16: 9.9609375e-08
else:
print("BF16 not supported on this hardware/PyTorch "
"build.")
Here's a quick summary of the key characteristics:
Feature | FP32 (Single) | FP16 (Half) | BF16 (BFloat16) |
---|---|---|---|
Total Bits | 32 | 16 | 16 |
Sign Bits | 1 | 1 | 1 |
Exponent Bits | 8 | 5 | 8 |
Mantissa Bits | 23 | 10 | 7 |
Dynamic Range | Wide | Narrow | Wide (like FP32) |
Precision | High | Medium | Low |
Memory / Value | 4 Bytes | 2 Bytes | 2 Bytes |
Stability Risk | Low | High (Range) | Low (Precision) |
Hardware Support | Universal | Widespread | Newer Accelerators |
Understanding these fundamental differences is essential for making informed decisions when implementing mixed-precision training. While FP16 and BF16 both offer memory and potential speed benefits, their different range and precision characteristics lead to different stability considerations and performance trade-offs, which we will explore in the subsequent sections.
© 2025 ApX Machine Learning