Alright, let's put theory into practice. As we've discussed, training deep neural networks involves navigating complex, high-dimensional loss surfaces riddled with challenges like saddle points, plateaus, and sharp minima. The choice of optimizer and its hyperparameters, combined with techniques like normalization and appropriate initialization, significantly impacts whether training converges successfully and efficiently.
This section provides hands-on exercises to observe these phenomena and develop intuition for tuning optimizers in a deep learning context. We'll use a common framework like TensorFlow/Keras or PyTorch and a standard dataset like CIFAR-10 (or a subset) to make things concrete. We assume you have a working environment and familiarity with basic model definition and training loops in your chosen framework.
First, let's define a simple Convolutional Neural Network (CNN) for image classification. A typical small CNN might consist of a few convolutional layers with ReLU activations, followed by max-pooling, and finally one or two fully connected layers leading to the output classification layer (e.g., softmax for CIFAR-10).
# Example using TensorFlow/Keras (conceptual)
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# Assume num_classes is defined (e.g., 10 for CIFAR-10)
model = keras.Sequential(
[
keras.Input(shape=(32, 32, 3)), # Input shape for CIFAR-10
layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
layers.MaxPooling2D(pool_size=(2, 2)),
layers.Conv2D(64, kernel_size=(3, 3), activation="relu"),
layers.MaxPooling2D(pool_size=(2, 2)),
layers.Flatten(),
layers.Dropout(0.5), # Regularization
layers.Dense(num_classes, activation="softmax"),
]
)
# Load CIFAR-10 data (x_train, y_train), (x_test, y_test)
# Preprocess data (normalize pixel values, one-hot encode labels)
# ... data loading and preprocessing code ...
We also need a function to compile and train the model, allowing us to easily switch optimizers and hyperparameters.
# Example using TensorFlow/Keras (conceptual)
def train_model(model, optimizer, x_train, y_train, validation_data, epochs=20, batch_size=64):
model.compile(loss="categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])
history = model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
validation_data=validation_data,
verbose=0) # Set verbose=1 to see progress per epoch
return history
# Example using PyTorch (conceptual)
# import torch
# import torch.optim as optim
# import torch.nn as nn
#
# # Assume model, train_loader, val_loader, criterion (e.g., nn.CrossEntropyLoss) are defined
# def train_model_pytorch(model, optimizer, train_loader, val_loader, criterion, epochs=20):
# train_losses, val_losses = [], []
# train_accs, val_accs = [], []
# # ... typical PyTorch training loop ...
# # for epoch in range(epochs):
# # model.train()
# # ... training steps ...
# # model.eval()
# # with torch.no_grad():
# # ... validation steps ...
# # Store metrics (loss, accuracy) for plotting
# # return history_dict # Dictionary containing lists of metrics
Let's start with Stochastic Gradient Descent (SGD), potentially with momentum. As discussed in Chapter 1, SGD uses noisy gradient estimates from mini-batches. While robust, it can be slow to converge on complex landscapes, especially without momentum, or might get stuck in suboptimal areas.
# Keras example
sgd_optimizer_nomomentum = tf.keras.optimizers.SGD(learning_rate=0.01)
sgd_optimizer_momentum = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9)
# Train with SGD (e.g., with momentum)
# history_sgd = train_model(model, sgd_optimizer_momentum, ...)
# PyTorch example
# sgd_optimizer_nomomentum = optim.SGD(model.parameters(), lr=0.01)
# sgd_optimizer_momentum = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
# history_sgd = train_model_pytorch(model, sgd_optimizer_momentum, ...)
Train the network using SGD (try both with and without momentum). Plot the training and validation loss and accuracy over epochs. You'll likely observe relatively slow convergence compared to adaptive methods, and the final accuracy might not be optimal within a limited number of epochs. The curves might also be quite noisy, reflecting the stochastic nature of the updates.
Now, let's try the adaptive learning rate methods discussed in Chapter 3, such as RMSprop and Adam. These methods maintain per-parameter learning rates based on moving averages of past squared gradients (RMSprop) or both first and second moments (Adam). They are often the default choice for deep learning tasks due to their generally fast convergence.
# Keras example
rmsprop_optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.001) # Default lr is often 0.001
adam_optimizer = tf.keras.optimizers.Adam(learning_rate=0.001) # Default lr is often 0.001
# Train with RMSprop and Adam
# model_rmsprop = build_model() # Re-initialize model weights
# history_rmsprop = train_model(model_rmsprop, rmsprop_optimizer, ...)
# model_adam = build_model() # Re-initialize model weights
# history_adam = train_model(model_adam, adam_optimizer, ...)
# PyTorch example
# rmsprop_optimizer = optim.RMSprop(model.parameters(), lr=0.001)
# adam_optimizer = optim.Adam(model.parameters(), lr=0.001)
# history_rmsprop = train_model_pytorch(...) # Re-initialize model
# history_adam = train_model_pytorch(...) # Re-initialize model
Train the same architecture (with re-initialized weights each time) using RMSprop and Adam with their common default learning rates (often 10−3). Plot the training/validation loss and accuracy curves for SGD (with momentum), RMSprop, and Adam on the same graph for comparison.
Comparison of typical training loss curves for SGD with momentum, RMSprop, and Adam on a sample CNN task. Adaptive methods often converge faster initially. Note the logarithmic scale for the y-axis to better visualize differences.
You should generally observe that Adam and RMSprop converge significantly faster in the initial epochs compared to SGD. However, monitor the validation performance closely. Sometimes, adaptive methods might converge quickly to a sharp minimum that generalizes slightly worse than the minimum found by SGD (though this is problem-dependent).
Adaptive optimizers are not magic bullets; their performance still hinges on hyperparameters, primarily the learning rate (η). Let's experiment with Adam's learning rate.
Train the model using Adam with different learning rates, for example: η=10−2, η=10−3 (default), and η=10−4.
# Keras example
adam_lr_high = tf.keras.optimizers.Adam(learning_rate=0.01)
adam_lr_low = tf.keras.optimizers.Adam(learning_rate=0.0001)
# history_adam_high = train_model(...) # Re-initialize model
# history_adam_low = train_model(...) # Re-initialize model
# PyTorch example
# adam_lr_high = optim.Adam(model.parameters(), lr=0.01)
# adam_lr_low = optim.Adam(model.parameters(), lr=0.0001)
# history_adam_high = train_model_pytorch(...) # Re-initialize model
# history_adam_low = train_model_pytorch(...) # Re-initialize model
Plot the training/validation loss curves for these different learning rates alongside the default.
Effect of varying the learning rate for the Adam optimizer. A rate that is too high (η=0.01) can lead to divergence or unstable training. A rate that is too low (η=0.0001) converges very slowly. The default (η=0.001) often provides a good starting point.
While Adam has other hyperparameters (β1,β2,ϵ), the learning rate is almost always the first and most important one to tune. You might also incorporate learning rate schedules (Chapter 3), such as reducing the learning rate by a factor every few epochs (step decay) or using cosine annealing. This is often beneficial, especially when combined with SGD or Adam, allowing larger steps initially and finer adjustments later.
# Keras example: Using a scheduler
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
initial_learning_rate=0.001,
decay_steps=10000, # Adjust based on dataset size/batch size
decay_rate=0.9)
adam_optimizer_scheduled = tf.keras.optimizers.Adam(learning_rate=lr_schedule)
# history_adam_scheduled = train_model(model, adam_optimizer_scheduled, ...)
# PyTorch example: Using a scheduler
# optimizer = optim.Adam(model.parameters(), lr=0.001)
# scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.5) # Halve LR every 10 epochs
# # Inside training loop:
# # optimizer.step()
# # scheduler.step() # Call after optimizer step
Let's revisit two important topics from earlier in this chapter: initialization and normalization. How do they interact with our optimizer choice?
Initialization: Try training your network with Adam (η=0.001) but using a poor initialization strategy (e.g., initializing all weights to zero or very small random numbers without proper scaling like He or Xavier/Glorot). You'll likely observe that the network fails to learn effectively, regardless of the optimizer. Gradients might vanish or explode right from the start, or symmetry issues might prevent learning. Compare this to training with proper He initialization (standard for ReLU activations). This highlights that even a sophisticated optimizer cannot easily overcome fundamental issues caused by poor initialization.
Batch Normalization: Now, add Batch Normalization layers after the convolutional layers (or dense layers, before activation) in your network architecture.
# Keras example with Batch Norm
model_bn = keras.Sequential(
[
keras.Input(shape=(32, 32, 3)),
layers.Conv2D(32, kernel_size=(3, 3)),
layers.BatchNormalization(), # Add BN
layers.Activation("relu"),
layers.MaxPooling2D(pool_size=(2, 2)),
# ... other layers with BN ...
layers.Flatten(),
layers.Dropout(0.5),
layers.Dense(num_classes, activation="softmax"),
]
)
# history_adam_bn = train_model(model_bn, adam_optimizer, ...) # Use same Adam config
Train this modified network using the same Adam optimizer (η=0.001). Compare its training/validation curves to the original network without Batch Normalization.
You should see that Batch Normalization often:
Batch Normalization helps by stabilizing the distribution of activations, smoothing the loss landscape, and reducing internal covariate shift, making the optimization task easier.
If you encounter exploding gradients (loss suddenly becomes NaN or skyrockets), especially in recurrent networks or very deep architectures, gradient clipping can be a useful tool. Most frameworks provide an easy way to implement this.
# Keras example: Clipping optimizer gradients
# Clip by global norm
adam_optimizer_clipped = tf.keras.optimizers.Adam(learning_rate=0.001, clipnorm=1.0)
# Or clip by value
# adam_optimizer_clipped = tf.keras.optimizers.Adam(learning_rate=0.001, clipvalue=0.5)
# history_adam_clipped = train_model(model, adam_optimizer_clipped, ...)
# PyTorch example: Clipping gradients manually
# # Inside training loop, after loss.backward():
# torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# # Or clip by value
# # torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=0.5)
# optimizer.step()
Experimenting with clipping is usually done if instability is observed. A clipnorm
value around 1.0 is a common starting point.
This practical session demonstrates several key aspects of tuning optimizers for deep networks:
Further ideas to explore:
Effective optimization in deep learning is often an empirical process. Understanding the principles behind different optimizers and techniques, combined with practical experimentation and careful monitoring of results, is essential for successfully training complex models.
© 2025 ApX Machine Learning