Mixed-precision training is a powerful optimization technique focused on enhancing the efficiency of computation on a single GPU. Most deep learning models are trained using standard 32-bit floating-point numbers (FP32), also known as single-precision. This format offers a wide dynamic range and high precision, making it a safe default. However, for many deep learning operations, this level of precision is not strictly necessary. This is where mixed-precision training becomes a valuable tool.
Mixed-precision training combines the use of 16-bit floating-point (FP16), or half-precision, with traditional FP32 numbers. The primary motivation is a significant boost in performance and a reduction in memory usage.
Performance comparison between standard FP32 cores and specialized Tensor Cores for FP16 operations on an NVIDIA A100 GPU. The performance gain is substantial.
Simply switching an entire model to FP16 can lead to problems. The FP16 format has a much smaller representable range compared to FP32. During backpropagation, gradient values can become very small. In FP16, these small values may round down to zero, a phenomenon known as underflow. When gradients become zero, weight updates for those parts of the model stop, and the network ceases to learn effectively.
The solution is not to abandon FP16 but to use it selectively alongside a technique called dynamic loss scaling. This approach forms the core of modern mixed-precision training.
Instead of manually deciding which operations to run in which precision, modern frameworks automate a three-step process:
inf or NaN), the factor is reduced. If training is stable for a period, the factor is increased to utilize the full FP16 range.The data flow in a single step of mixed-precision training with dynamic loss scaling.
Fortunately, you rarely need to implement this logic from scratch. Major deep learning frameworks provide high-level APIs that handle all the details for you.
PyTorch provides automatic mixed-precision functionality through its torch.cuda.amp module. Implementation requires adding just a few lines of code to a standard training loop.
You need two components: autocast and GradScaler.
torch.cuda.amp.autocast: A context manager that automatically selects the appropriate precision (FP16 or FP32) for each operation within its scope.torch.cuda.amp.GradScaler: Manages the dynamic loss scaling process to prevent gradient underflow.Here is a comparison of a standard PyTorch training loop and one modified for automatic mixed-precision (AMP).
Standard PyTorch Training Loop:
# Standard training loop
optimizer.zero_grad()
outputs = model(inputs)
loss = loss_fn(outputs, targets)
loss.backward()
optimizer.step()
Training Loop with PyTorch AMP:
# Recommended: Create the scaler once, outside the training loop
scaler = torch.cuda.amp.GradScaler()
# --- Inside the training loop ---
optimizer.zero_grad()
# Casts operations to FP16 where appropriate
with torch.cuda.amp.autocast():
outputs = model(inputs)
loss = loss_fn(outputs, targets)
# Scales loss. Calls backward() on scaled loss to create scaled gradients.
scaler.scale(loss).backward()
# scaler.step() first unscales the gradients of the optimizer's params.
# If no infs/NaNs are found, optimizer.step() is then called.
# Otherwise, optimizer.step() is skipped.
scaler.step(optimizer)
# Updates the scale for next iteration.
scaler.update()
TensorFlow's Keras API makes enabling mixed precision even simpler. You set a global policy, and Keras automatically handles the rest, including loss scaling within the model.fit() method.
To enable mixed precision, you add one line of code at the beginning of your script.
import tensorflow as tf
# Set the global policy to use mixed precision
tf.keras.mixed_precision.set_global_policy('mixed_float16')
# ... define your model as usual
inputs = tf.keras.Input(shape=(...))
# Layers will automatically use mixed precision
outputs = tf.keras.layers.Dense(10, activation='softmax')(inputs)
model = tf.keras.Model(inputs=inputs, outputs=outputs)
optimizer = tf.keras.optimizers.Adam()
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy')
# model.fit will automatically handle loss scaling
# model.fit(x_train, y_train, ...)
When you set this policy, Keras automatically wraps your optimizer to handle loss scaling and ensures that layers like Dense and Conv2D use FP16 computation while maintaining FP32 weights. This change is transparent and requires minimal code modification.
By adopting mixed-precision training, you can often achieve speedups of 1.5x to 3x on supported hardware with just a few lines of code, making it one of the most effective optimizations for GPU-bound workloads.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with