Having explored various advanced optimization algorithms, learning rate schedules, regularization techniques, and normalization strategies, it's time to integrate these components into a practical, sophisticated training loop. Training deep and complex CNNs effectively often requires more than just a basic model.fit()
call. This practical section focuses on assembling a custom training loop that incorporates several of the advanced techniques discussed in this chapter. We will structure this using Python and concepts common in frameworks like PyTorch.
Let's assume you have your model (model
), your dataset loaded via data loaders (train_loader
, val_loader
), and a base loss function (like CrossEntropyLoss
). Our goal is to augment the standard training procedure with:
First, we initialize the necessary components. We'll move the model to the appropriate device (e.g., GPU).
import torch
import torch.nn as nn
import torch.optim as optim
from torch.cuda.amp import GradScaler, autocast
from torch.optim.lr_scheduler import OneCycleLR
# Assume 'model' is your defined CNN architecture
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
# 1. Advanced Optimizer: AdamW
# Note the weight_decay parameter which is handled correctly unlike standard Adam
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2)
# 2. Learning Rate Scheduler: OneCycleLR
# Requires total_steps = epochs * len(train_loader)
# max_lr should typically be determined using an LR range test, but we set a placeholder
epochs = 10
total_steps = epochs * len(train_loader)
scheduler = OneCycleLR(optimizer, max_lr=1e-2, total_steps=total_steps)
# 3. Loss Function with Label Smoothing
# Label smoothing helps prevent overfitting by making the model less confident
# A value of 0.1 is common.
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
# 4. Mixed Precision Training: GradScaler
# scaler helps manage gradient scaling to prevent underflow with float16
scaler = GradScaler(enabled=torch.cuda.is_available() and torch.backends.cudnn.is_available())
The OneCycleLR
schedule varies the learning rate significantly throughout training. It starts low, increases to a maximum (max_lr
), and then decays. Visualizing this helps understand its behavior.
# Example visualization data (replace with actual scheduler steps)
steps = list(range(total_steps))
lrs = []
# Simulate LR changes (requires dummy optimizer state updates)
temp_optimizer = optim.AdamW([torch.zeros(1)], lr=1e-3) # Dummy parameter
temp_scheduler = OneCycleLR(temp_optimizer, max_lr=1e-2, total_steps=total_steps)
for _ in steps:
lrs.append(temp_scheduler.get_last_lr()[0])
temp_optimizer.step() # Need to call step to advance scheduler
temp_scheduler.step()
{"data": [{"x": steps, "y": lrs, "type": "scatter", "mode": "lines", "name": "Learning Rate", "line": {"color": "#4263eb"}}], "layout": {"title": "OneCycleLR Schedule Example", "xaxis": {"title": "Training Step"}, "yaxis": {"title": "Learning Rate"}, "height": 350, "margin": {"l": 50, "r": 20, "t": 50, "b": 40}}}
Learning rate profile generated by the OneCycleLR policy over the total training steps. Note the warm-up, peak, and cool-down phases.
Now, let's integrate these into a function that performs one training step (processing one batch). The key additions are the use of autocast
for the forward pass and scaler
for the backward pass and optimizer step.
def train_step(model, batch, optimizer, criterion, scaler, scheduler, device):
"""Performs one training step with advanced features."""
model.train() # Set model to training mode
inputs, targets = batch
inputs, targets = inputs.to(device), targets.to(device)
optimizer.zero_grad()
# Use autocast for the forward pass (mixed precision)
with autocast(enabled=scaler.is_enabled()):
outputs = model(inputs)
# Loss calculation uses smoothed targets implicitly via criterion init
loss = criterion(outputs, targets)
# Scale the loss and perform backward pass
scaler.scale(loss).backward()
# Optional: Gradient Clipping (discussed earlier)
# torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# Scaler steps the optimizer
scaler.step(optimizer)
# Update the scaler for next iteration
scaler.update()
# Step the learning rate scheduler (happens per batch for OneCycleLR)
scheduler.step()
# Return loss for monitoring
return loss.item(), scheduler.get_last_lr()[0]
We can now assemble the complete training loop over epochs, incorporating the train_step
function and adding validation and monitoring.
# --- Monitoring Setup (Example: Using lists, integrate with TensorBoard/WandB in practice) ---
train_losses = []
learning_rates = []
val_accuracies = []
# -----------------------------------------------------------------------------------------
print("Starting Advanced Training...")
for epoch in range(epochs):
epoch_loss = 0.0
model.train() # Ensure model is in training mode
for batch_idx, batch in enumerate(train_loader):
loss, current_lr = train_step(model, batch, optimizer, criterion, scaler, scheduler, device)
epoch_loss += loss
# --- Monitoring ---
if batch_idx % 100 == 0: # Log every 100 batches
print(f"Epoch {epoch+1}/{epochs}, Batch {batch_idx}/{len(train_loader)}, Loss: {loss:.4f}, LR: {current_lr:.6f}")
learning_rates.append(current_lr)
# -----------------
avg_epoch_loss = epoch_loss / len(train_loader)
train_losses.append(avg_epoch_loss)
print(f"Epoch {epoch+1} Average Training Loss: {avg_epoch_loss:.4f}")
# --- Validation Phase ---
model.eval() # Set model to evaluation mode
correct = 0
total = 0
with torch.no_grad(): # Disable gradient calculation for validation
for batch in val_loader:
inputs, targets = batch
inputs, targets = inputs.to(device), targets.to(device)
# Use autocast even during validation for consistency if needed, but usually not required
with autocast(enabled=scaler.is_enabled()):
outputs = model(inputs)
_, predicted = torch.max(outputs.data, 1)
total += targets.size(0)
correct += (predicted == targets).sum().item()
accuracy = 100 * correct / total
val_accuracies.append(accuracy)
print(f"Epoch {epoch+1} Validation Accuracy: {accuracy:.2f}%")
# ----------------------
# --- Post-Training ---
# Save model, plot metrics, etc.
print("Training Finished.")
# Example: Plot loss curve
# (Requires matplotlib)
# import matplotlib.pyplot as plt
# plt.plot(range(1, epochs + 1), train_losses, label='Training Loss')
# plt.xlabel('Epoch')
# plt.ylabel('Loss')
# plt.legend()
# plt.show()
# ---------------------
Implementing these advanced techniques can sometimes introduce new challenges:
NaN
values in loss or gradients can occur if the gradient scaler's parameters are not suitable or if numerical instability arises in certain operations under FP16. Ensure your network layers are compatible with mixed precision. Check the scaler.get_scale()
value; if it becomes very small or inf
/NaN
, adjust the init_scale
or growth_interval
of GradScaler
.max_lr
for OneCycleLR
is a sensitive hyperparameter. Using an LR range test beforehand is highly recommended. The interplay between the scheduler, optimizer (especially weight_decay
), and batch size needs careful tuning.label_smoothing
factor (e.g., 0.1) is another hyperparameter to potentially tune.This practical exercise demonstrates how to integrate several powerful techniques from this chapter into a coherent training loop. While the setup involves more code than simpler approaches, the potential gains in training speed, stability, model robustness, and final performance on complex tasks make mastering these advanced loops a valuable skill for deep learning practitioners working on challenging computer vision problems.
© 2025 ApX Machine Learning