While selecting appropriate initialization strategies and learning rate schedules sets a solid foundation for training, the performance of deep learning models often hinges significantly on finding the right values for several critical hyperparameters. These are settings configured before the training process begins, unlike model parameters (weights and biases) which are learned during training. Tuning these hyperparameters is a fundamental part of the deep learning workflow, often requiring systematic exploration and experimentation.
In this section, we focus on three particularly influential hyperparameters: the learning rate (α), the regularization strength (λ), and the mini-batch size. Understanding how to adjust these can markedly impact your model's convergence speed, final performance, and ability to generalize to new data.
Recall that model parameters are the weights and biases within the network that the optimization algorithm adjusts during training to minimize the loss function. Hyperparameters, on the other hand, are external configurations that define the model's structure or the training process itself. Examples include:
Finding a good combination of hyperparameters is often more art than science, guided by experience, intuition, and iterative experimentation.
The learning rate is arguably the most important hyperparameter to tune. As discussed in previous chapters, it controls the step size taken during gradient descent.
Finding an effective learning rate often involves searching within a logarithmic range, such as 10−1,10−2,10−3,10−4,10−5. A common starting point for Adam is often around 10−3 or 10−4, while SGD with momentum might start around 10−2. However, these are just heuristics, and the optimal value depends heavily on the dataset, model architecture, optimizer choice, and even the batch size.
Learning rate schedules, covered previously, help by adjusting α during training, but the initial learning rate and the parameters of the schedule itself (e.g., decay rate, step size) still need careful selection. Monitoring the training loss curve is essential; a rapidly decreasing but stable loss suggests a good learning rate, while oscillations or divergence indicate it's likely too high.
Illustration of training loss curves for different learning rates. A well-chosen rate shows steady decrease, while too small a rate converges slowly, and too large a rate causes instability or divergence.
Regularization techniques like L1 and L2 (Weight Decay), introduced in Chapter 2, add a penalty term to the loss function based on the magnitude of the model weights. The regularization strength, often denoted by λ (lambda), controls the weight of this penalty.
Total Loss=Original Loss (e.g., Cross-Entropy)+λ×Regularization TermSimilar to the learning rate, λ is often tuned on a logarithmic scale, exploring values like 0.1,0.01,0.001,0.0001,0. The optimal value depends on the degree of overfitting observed without regularization. If the model overfits heavily (large gap between training and validation loss/accuracy), a larger λ might be needed. If the model underfits, λ should be reduced or set to zero. Remember that other regularization techniques like Dropout and Batch Normalization also influence the optimal λ.
The batch size determines how many training examples are processed before the model's weights are updated. It impacts both training dynamics and computational resource usage.
The choice of batch size is often constrained by GPU memory. Common practice involves starting with a standard size like 32, 64, or 128 and adjusting based on performance and memory constraints. It's also important to note the relationship between batch size and learning rate, which we explore in the next section. Powers of 2 are often chosen for batch sizes due to hardware memory alignment efficiencies, but this is not a strict requirement.
Finding the right combination of these hyperparameters is crucial for maximizing model performance. The next sections will discuss strategies like grid search and random search to navigate this complex tuning process more systematically.
© 2025 ApX Machine Learning