Training deep neural networks involves navigating complex, high-dimensional loss surfaces riddled with challenges like saddle points, plateaus, and sharp minima. The choice of optimizer and its hyperparameters, combined with techniques like normalization and appropriate initialization, significantly impacts whether training converges successfully and efficiently.This section provides hands-on exercises to observe these phenomena and develop intuition for tuning optimizers in a deep learning context. We'll use a common framework like TensorFlow/Keras or PyTorch and a standard dataset like CIFAR-10 (or a subset) to make things concrete. We assume you have a working environment and familiarity with basic model definition and training loops in your chosen framework.Setting Up the ExperimentFirst, let's define a simple Convolutional Neural Network (CNN) for image classification. A typical small CNN might consist of a few convolutional layers with ReLU activations, followed by max-pooling, and finally one or two fully connected layers leading to the output classification layer (e.g., softmax for CIFAR-10).# Example using TensorFlow/Keras import tensorflow as tf from tensorflow import keras from tensorflow.keras import layers # Assume num_classes is defined (e.g., 10 for CIFAR-10) model = keras.Sequential( [ keras.Input(shape=(32, 32, 3)), # Input shape for CIFAR-10 layers.Conv2D(32, kernel_size=(3, 3), activation="relu"), layers.MaxPooling2D(pool_size=(2, 2)), layers.Conv2D(64, kernel_size=(3, 3), activation="relu"), layers.MaxPooling2D(pool_size=(2, 2)), layers.Flatten(), layers.Dropout(0.5), # Regularization layers.Dense(num_classes, activation="softmax"), ] ) # Load CIFAR-10 data (x_train, y_train), (x_test, y_test) # Preprocess data (normalize pixel values, one-hot encode labels) # ... data loading and preprocessing code ...We also need a function to compile and train the model, allowing us to easily switch optimizers and hyperparameters.# Example using TensorFlow/Keras def train_model(model, optimizer, x_train, y_train, validation_data, epochs=20, batch_size=64): model.compile(loss="categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"]) history = model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_data=validation_data, verbose=0) # Set verbose=1 to see progress per epoch return history # Example using PyTorch # import torch # import torch.optim as optim # import torch.nn as nn # # # Assume model, train_loader, val_loader, criterion (e.g., nn.CrossEntropyLoss) are defined # def train_model_pytorch(model, optimizer, train_loader, val_loader, criterion, epochs=20): # train_losses, val_losses = [], [] # train_accs, val_accs = [], [] # # ... typical PyTorch training loop ... # # for epoch in range(epochs): # # model.train() # # ... training steps ... # # model.eval() # # with torch.no_grad(): # # ... validation steps ... # # Store metrics (loss, accuracy) for plotting # # return history_dict # Dictionary containing lists of metricsBaseline: Standard SGDLet's start with Stochastic Gradient Descent (SGD), potentially with momentum. As discussed in Chapter 1, SGD uses noisy gradient estimates from mini-batches. While effective, it can be slow to converge on complex problems, especially without momentum, or might get stuck in suboptimal areas.# Keras example sgd_optimizer_nomomentum = tf.keras.optimizers.SGD(learning_rate=0.01) sgd_optimizer_momentum = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9) # Train with SGD (e.g., with momentum) # history_sgd = train_model(model, sgd_optimizer_momentum, ...) # PyTorch example # sgd_optimizer_nomomentum = optim.SGD(model.parameters(), lr=0.01) # sgd_optimizer_momentum = optim.SGD(model.parameters(), lr=0.01, momentum=0.9) # history_sgd = train_model_pytorch(model, sgd_optimizer_momentum, ...)Train the network using SGD (try both with and without momentum). Plot the training and validation loss and accuracy over epochs. You'll likely observe relatively slow convergence compared to adaptive methods, and the final accuracy might not be optimal within a limited number of epochs. The curves might also be quite noisy, reflecting the stochastic nature of the updates.Comparing Adaptive Optimizers: Adam and RMSpropNow, let's try the adaptive learning rate methods discussed in Chapter 3, such as RMSprop and Adam. These methods maintain per-parameter learning rates based on moving averages of past squared gradients (RMSprop) or both first and second moments (Adam). They are often the default choice for deep learning tasks due to their generally fast convergence.# Keras example rmsprop_optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.001) # Default lr is often 0.001 adam_optimizer = tf.keras.optimizers.Adam(learning_rate=0.001) # Default lr is often 0.001 # Train with RMSprop and Adam # model_rmsprop = build_model() # Re-initialize model weights # history_rmsprop = train_model(model_rmsprop, rmsprop_optimizer, ...) # model_adam = build_model() # Re-initialize model weights # history_adam = train_model(model_adam, adam_optimizer, ...) # PyTorch example # rmsprop_optimizer = optim.RMSprop(model.parameters(), lr=0.001) # adam_optimizer = optim.Adam(model.parameters(), lr=0.001) # history_rmsprop = train_model_pytorch(...) # Re-initialize model # history_adam = train_model_pytorch(...) # Re-initialize modelTrain the same architecture (with re-initialized weights each time) using RMSprop and Adam with their common default learning rates (often $10^{-3}$). Plot the training/validation loss and accuracy curves for SGD (with momentum), RMSprop, and Adam on the same graph for comparison.{"layout":{"title":"Optimizer Comparison: Training Loss","xaxis":{"title":"Epoch"},"yaxis":{"title":"Loss","type":"log"},"legend":{"title":"Optimizer"}},"data":[{"type":"scatter","mode":"lines","name":"SGD (lr=0.01, m=0.9)","x":[1,2,3,4,5,10,15,20],"y":[2.1,1.8,1.6,1.5,1.4,1.2,1.1,1.0],"line":{"color":"#1c7ed6"}},{"type":"scatter","mode":"lines","name":"RMSprop (lr=0.001)","x":[1,2,3,4,5,10,15,20],"y":[1.7,1.4,1.25,1.15,1.1,0.9,0.8,0.75],"line":{"color":"#f76707"}},{"type":"scatter","mode":"lines","name":"Adam (lr=0.001)","x":[1,2,3,4,5,10,15,20],"y":[1.6,1.3,1.15,1.05,1.0,0.8,0.7,0.65],"line":{"color":"#37b24d"}}]}Comparison of typical training loss curves for SGD with momentum, RMSprop, and Adam on a sample CNN task. Adaptive methods often converge faster initially. Note the logarithmic scale for the y-axis to better visualize differences.You should generally observe that Adam and RMSprop converge significantly faster in the initial epochs compared to SGD. However, monitor the validation performance closely. Sometimes, adaptive methods might converge quickly to a sharp minimum that generalizes slightly worse than the minimum found by SGD (though this is problem-dependent).Tuning Optimizer HyperparametersAdaptive optimizers are not magic bullets; their performance still hinges on hyperparameters, primarily the learning rate ($\eta$). Let's experiment with Adam's learning rate.Train the model using Adam with different learning rates, for example: $\eta = 10^{-2}$, $\eta = 10^{-3}$ (default), and $\eta = 10^{-4}$.# Keras example adam_lr_high = tf.keras.optimizers.Adam(learning_rate=0.01) adam_lr_low = tf.keras.optimizers.Adam(learning_rate=0.0001) # history_adam_high = train_model(...) # Re-initialize model # history_adam_low = train_model(...) # Re-initialize model # PyTorch example # adam_lr_high = optim.Adam(model.parameters(), lr=0.01) # adam_lr_low = optim.Adam(model.parameters(), lr=0.0001) # history_adam_high = train_model_pytorch(...) # Re-initialize model # history_adam_low = train_model_pytorch(...) # Re-initialize modelPlot the training/validation loss curves for these different learning rates alongside the default.{"layout":{"title":"Effect of Adam Learning Rate: Validation Loss","xaxis":{"title":"Epoch"},"yaxis":{"title":"Validation Loss"},"legend":{"title":"Learning Rate"}},"data":[{"type":"scatter","mode":"lines","name":"η = 0.01","x":[1,2,3,4,5,10,15,20],"y":[2.5,2.8,3.0,3.1,3.0,3.2,3.3,3.4],"line":{"color":"#f03e3e"}},{"type":"scatter","mode":"lines","name":"η = 0.001 (Default)","x":[1,2,3,4,5,10,15,20],"y":[1.6,1.4,1.3,1.25,1.2,1.1,1.05,1.0],"line":{"color":"#37b24d"}},{"type":"scatter","mode":"lines","name":"η = 0.0001","x":[1,2,3,4,5,10,15,20],"y":[2.0,1.85,1.75,1.7,1.65,1.5,1.4,1.3],"line":{"color":"#4263eb"}}]}Effect of varying the learning rate for the Adam optimizer. A rate that is too high ($\eta=0.01$) can lead to divergence or unstable training. A rate that is too low ($\eta=0.0001$) converges very slowly. The default ($\eta=0.001$) often provides a good starting point.Too High ($\eta=10^{-2}$): The loss might oscillate wildly, decrease very slowly, or even increase. The optimizer steps are too large, constantly overshooting minima.Too Low ($\eta=10^{-4}$): Convergence will be very slow. You might reach a good solution eventually, but it will take many more epochs.Just Right ($\eta=10^{-3}$): Often a good starting point, balancing convergence speed and stability.While Adam has other hyperparameters ($\beta_1, \beta_2, \epsilon$), the learning rate is almost always the first and most important one to tune. You might also incorporate learning rate schedules (Chapter 3), such as reducing the learning rate by a factor every few epochs (step decay) or using cosine annealing. This is often beneficial, especially when combined with SGD or Adam, allowing larger steps initially and finer adjustments later.# Keras example: Using a scheduler lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay( initial_learning_rate=0.001, decay_steps=10000, # Adjust based on dataset size/batch size decay_rate=0.9) adam_optimizer_scheduled = tf.keras.optimizers.Adam(learning_rate=lr_schedule) # history_adam_scheduled = train_model(model, adam_optimizer_scheduled, ...) # PyTorch example: Using a scheduler # optimizer = optim.Adam(model.parameters(), lr=0.001) # scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.5) # Halve LR every 10 epochs # # Inside training loop: # # optimizer.step() # # scheduler.step() # Call after optimizer stepInteraction with Initialization and NormalizationLet's revisit two important topics from earlier in this chapter: initialization and normalization. How do they interact with our optimizer choice?Initialization: Try training your network with Adam ($\eta=0.001$) but using a poor initialization strategy (e.g., initializing all weights to zero or very small random numbers without proper scaling like He or Xavier/Glorot). You'll likely observe that the network fails to learn effectively, regardless of the optimizer. Gradients might vanish or explode right from the start, or symmetry issues might prevent learning. Compare this to training with proper He initialization (standard for ReLU activations). This highlights that even a sophisticated optimizer cannot easily overcome fundamental issues caused by poor initialization.Batch Normalization: Now, add Batch Normalization layers after the convolutional layers (or dense layers, before activation) in your network architecture.# Keras example with Batch Norm model_bn = keras.Sequential( [ keras.Input(shape=(32, 32, 3)), layers.Conv2D(32, kernel_size=(3, 3)), layers.BatchNormalization(), # Add BN layers.Activation("relu"), layers.MaxPooling2D(pool_size=(2, 2)), # ... other layers with BN ... layers.Flatten(), layers.Dropout(0.5), layers.Dense(num_classes, activation="softmax"), ] ) # history_adam_bn = train_model(model_bn, adam_optimizer, ...) # Use same Adam configTrain this modified network using the same Adam optimizer ($\eta=0.001$). Compare its training/validation curves to the original network without Batch Normalization.You should see that Batch Normalization often:Accelerates convergence significantly.Makes the training process less sensitive to the choice of learning rate (you might even be able to use a higher learning rate successfully).Provides some regularization effect.Batch Normalization helps by stabilizing the distribution of activations, smoothing the loss, and reducing internal covariate shift, making the optimization task easier.Handling Gradients: ClippingIf you encounter exploding gradients (loss suddenly becomes NaN or skyrockets), especially in recurrent networks or very deep architectures, gradient clipping can be a useful tool. Most frameworks provide an easy way to implement this.# Keras example: Clipping optimizer gradients # Clip by global norm adam_optimizer_clipped = tf.keras.optimizers.Adam(learning_rate=0.001, clipnorm=1.0) # Or clip by value # adam_optimizer_clipped = tf.keras.optimizers.Adam(learning_rate=0.001, clipvalue=0.5) # history_adam_clipped = train_model(model, adam_optimizer_clipped, ...) # PyTorch example: Clipping gradients manually # # Inside training loop, after loss.backward(): # torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) # # Or clip by value # # torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=0.5) # optimizer.step()Experimenting with clipping is usually done if instability is observed. A clipnorm value around 1.0 is a common starting point.Summary and Further ExperimentsThis practical session demonstrates several important aspects of tuning optimizers for deep networks:Adaptive methods like Adam and RMSprop often provide faster initial convergence than standard SGD, making them popular defaults.The learning rate is the most critical hyperparameter to tune for any optimizer. Defaults are starting points, not guaranteed optima.Techniques like proper weight initialization and Batch Normalization are not optimizer features themselves, but they significantly ease the optimization process, often enabling faster and more stable convergence.Learning rate schedules are commonly used to improve final performance by reducing the learning rate as training progresses.Gradient clipping is a remedy for exploding gradients, particularly relevant in specific architectures.Further ideas to explore:Compare Adam with its variants like Nadam or AMSGrad (Chapter 3).Implement different learning rate schedules (cosine annealing, cyclical LR).Try tuning SGD with momentum and a well-chosen learning rate schedule; it can sometimes achieve better generalization than Adam on certain tasks, albeit often requiring more careful tuning.Use more systematic hyperparameter optimization techniques like random search or Bayesian optimization (Chapter 7) if computational resources allow.Observe the effect of different batch sizes on optimizer performance and noise (Chapter 4).Effective optimization in deep learning is often an empirical process. Understanding the principles behind different optimizers and techniques, combined with practical experimentation and careful monitoring of results, is essential for successfully training complex models.