This hands-on lab guides you through modifying a standard PyTorch training script to use Automatic Mixed-Precision (AMP) for performance optimization. The goal is to measure the resulting improvements in training speed and memory consumption, demonstrating firsthand how a few small code changes can yield significant performance gains.
To follow along, you will need an environment with:
pip install torch torchvision).First, let's establish a baseline. The following script sets up a simple Convolutional Neural Network (CNN) and trains it on the CIFAR-10 dataset using standard 32-bit floating-point precision (FP32). This will be our point of comparison.
Save the following code as fp32_baseline.py.
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import time
# 1. Data Loading and Preparation
transform = transforms.Compose(
[transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=256,
shuffle=True, num_workers=4)
# 2. A simple CNN Model
class SimpleCNN(nn.Module):
def __init__(self):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(3, 32, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2),
nn.Conv2d(32, 64, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2)
)
self.classifier = nn.Sequential(
nn.Linear(64 * 8 * 8, 512),
nn.ReLU(),
nn.Linear(512, 10)
)
def forward(self, x):
x = self.features(x)
x = torch.flatten(x, 1)
x = self.classifier(x)
return x
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = SimpleCNN().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
# 3. Standard FP32 Training Loop
print("Starting FP32 baseline training...")
start_time = time.time()
for epoch in range(5): # loop over the dataset 5 times
running_loss = 0.0
for i, data in enumerate(trainloader, 0):
inputs, labels = data[0].to(device), data[1].to(device)
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
if i % 100 == 99: # print every 100 mini-batches
print(f'[Epoch: {epoch + 1}, Batch: {i + 1:5d}] loss: {running_loss / 100:.3f}')
running_loss = 0.0
end_time = time.time()
total_time = end_time - start_time
print(f'Finished FP32 Training in {total_time:.2f} seconds')
Run this script from your terminal:
python fp32_baseline.py
Take note of the total training time. You can also monitor GPU memory usage during the run using the nvidia-smi command in another terminal window. This will be our baseline for comparison.
To enable mixed-precision training, we'll use PyTorch's built-in torch.cuda.amp module. This requires two main changes to our training loop:
torch.cuda.amp.autocast: This is a context manager that you wrap around your model's forward pass. It instructs PyTorch to automatically select the optimal data type (FP16 or FP32) for each operation. Operations that benefit from FP16 and are numerically stable, like convolutions and linear layers on Tensor Cores, will run in half-precision. Other operations that require higher precision, like reductions, will remain in FP32.
torch.cuda.amp.GradScaler: Training with FP16 can cause a problem called gradient underflow. Because the FP16 data type has a limited range, very small gradient values can become zero, halting the learning process. The GradScaler solves this by scaling the loss value upwards before the backward pass. This makes all the resulting gradients larger, preventing them from becoming zero. Before the optimizer updates the weights, the scaler then un-scales the gradients back to their correct values.
Now, let's modify the script to incorporate AMP. The changes are surprisingly minor, which is a testament to the design of the PyTorch API.
Save this new version as amp_optimized.py. The changes are highlighted in the comments.
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import time
# 1. Data Loading and Preparation (No changes here)
transform = transforms.Compose(
[transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=256,
shuffle=True, num_workers=4)
# 2. Model definition (No changes here)
class SimpleCNN(nn.Module):
def __init__(self):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(3, 32, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2),
nn.Conv2d(32, 64, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2)
)
self.classifier = nn.Sequential(
nn.Linear(64 * 8 * 8, 512),
nn.ReLU(),
nn.Linear(512, 10)
)
def forward(self, x):
x = self.features(x)
x = torch.flatten(x, 1)
x = self.classifier(x)
return x
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = SimpleCNN().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
# AMP CHANGE 1: Initialize a GradScaler
scaler = torch.cuda.amp.GradScaler()
# 3. Modified Training Loop with AMP
print("Starting AMP optimized training...")
start_time = time.time()
for epoch in range(5): # loop over the dataset 5 times
running_loss = 0.0
for i, data in enumerate(trainloader, 0):
inputs, labels = data[0].to(device), data[1].to(device)
# zero the parameter gradients
optimizer.zero_grad()
# AMP CHANGE 2: Wrap the forward pass with autocast
with torch.cuda.amp.autocast():
outputs = model(inputs)
loss = criterion(outputs, labels)
# AMP CHANGE 3: Use the scaler for the backward pass and optimizer step
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
running_loss += loss.item()
if i % 100 == 99: # print every 100 mini-batches
print(f'[Epoch: {epoch + 1}, Batch: {i + 1:5d}] loss: {running_loss / 100:.3f}')
running_loss = 0.0
end_time = time.time()
total_time = end_time - start_time
print(f'Finished AMP Training in {total_time:.2f} seconds')
Run the optimized script:
python amp_optimized.py
Once the script finishes, compare the total training time with the baseline fp32_baseline.py run. You should observe a noticeable reduction in training time. If you monitored GPU memory usage with nvidia-smi, you will also see a significant drop.
Your results will vary based on your specific GPU, but they might look something like this:
| Metric | FP32 (Baseline) | AMP (FP16/FP32) | Improvement |
|---|---|---|---|
| Training Time (5 epochs) | ~75 sec | ~48 sec | ~36% faster |
| Peak GPU Memory | ~1.8 GB | ~1.1 GB | ~39% less |
The performance gain comes from two sources. First, FP16 operations are much faster on GPUs with Tensor Cores. Second, using half-precision data reduces the memory footprint of the model, activations, and gradients, which lessens the time spent on memory transfers.
Performance comparison showing a significant reduction in total training time when using Automatic Mixed-Precision.
This lab demonstrates that AMP is a powerful and easy-to-implement technique. With just three lines of code, you can often achieve substantial speedups and memory savings, making it one of the first optimizations to consider when your training process becomes a bottleneck.
Was this section helpful?
torch.cuda.amp.autocast and torch.cuda.amp.GradScaler for mixed-precision training.© 2026 ApX Machine LearningEngineered with