Once you've chosen a loss function to quantify how wrong your model's predictions are, the next question is: how do you actually reduce that loss? This is where optimization algorithms come in. Their job is to systematically adjust the model's parameters (weights and biases) in response to the loss signal, iteratively improving the model's performance.
Think of the training process as trying to find the lowest point in a complex, high-dimensional landscape, where the "height" at any point represents the loss value for a given set of model weights. The optimization algorithm is your guide for navigating this landscape.
The most fundamental optimization technique is Gradient Descent (GD). In the previous section, "The Compilation Step," we mentioned that compiling involves specifying an optimizer. This optimizer relies on the gradients calculated during backpropagation (which we'll cover conceptually next).
The gradient of the loss function with respect to the model parameters tells us the direction of the steepest ascent in the loss landscape. To minimize the loss, we want to move in the opposite direction, the direction of steepest descent.
Gradient Descent works by calculating the gradient of the loss function for the entire training dataset and then taking a step downhill:
New Weights=Old Weights−Learning Rate×GradientThe learning rate (α) is a small scalar value (e.g., 0.01, 0.001) that controls the size of the steps we take. A learning rate that's too small leads to slow convergence, while one that's too large can cause the optimization process to overshoot the minimum or even diverge.
While conceptually simple, calculating the gradient over the entire dataset (Batch Gradient Descent) can be computationally very expensive, especially for large datasets. This leads us to more practical variants.
Instead of using the entire dataset for each weight update, Stochastic Gradient Descent (SGD) updates the weights based on the gradient computed from just one randomly chosen training sample at a time. This makes each update much faster but also much noisier, as the gradient from a single sample might not be representative of the overall loss landscape.
A common and highly effective compromise is Mini-Batch Gradient Descent. Here, the gradient is calculated and weights are updated based on a small, randomly selected subset of the training data, called a mini-batch. Typical batch sizes range from 32 to 256 samples. This approach balances the computational efficiency of SGD with the more stable convergence of Batch GD. In deep learning practice, "SGD" almost always refers to mini-batch gradient descent.
Simplified 2D view showing how different gradient descent variants might navigate a loss surface towards the minimum (0,0). Batch GD takes a direct path, SGD is noisy, and Mini-Batch offers a balance.
Momentum: A popular enhancement to SGD is momentum. It helps accelerate SGD in the relevant direction and dampens oscillations. It does this by adding a fraction of the previous update vector to the current update vector, accumulating velocity in directions of persistent gradients.
In Keras, you can use SGD with momentum like this:
import keras
optimizer = keras.optimizers.SGD(learning_rate=0.01, momentum=0.9)
# model.compile(optimizer=optimizer, ...)
While SGD (with momentum) is a solid algorithm, tuning its learning rate can sometimes be challenging. Adaptive algorithms automatically adjust the learning rate during training, often requiring less manual tuning.
Adam is arguably the most popular optimization algorithm in deep learning today. It's often a good default choice and works well across a wide range of problems. Adam computes adaptive learning rates for each parameter. Conceptually, it combines two main ideas:
It uses these averages to scale the learning rate for each parameter. Parameters receiving large or frequent gradients will have their effective learning rate reduced, while parameters with small or infrequent gradients will have theirs increased.
Using Adam in Keras is straightforward:
import keras
optimizer = keras.optimizers.Adam(learning_rate=0.001) # Default LR is often 0.001
# model.compile(optimizer=optimizer, ...)
RMSprop is another adaptive learning rate algorithm that also works by dividing the learning rate by an exponentially decaying average of squared gradients. It was developed around the same time as Adam and shares similarities. It often performs well, particularly on recurrent neural networks.
import keras
optimizer = keras.optimizers.RMSprop(learning_rate=0.001)
# model.compile(optimizer=optimizer, ...)
So, which optimizer should you use?
Experimentation is often necessary. The performance of an optimizer can depend heavily on the specific problem, dataset, and model architecture. Keep in mind that techniques like learning rate scheduling (adjusting the learning rate during training, often managed via Callbacks discussed later) can significantly impact the performance of any optimizer.
In the compile
step, you simply pass an instance of your chosen optimizer class to the optimizer
argument:
# Example using Adam
model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.001),
loss='categorical_crossentropy',
metrics=['accuracy'])
# Example using SGD
# model.compile(optimizer=keras.optimizers.SGD(learning_rate=0.01, momentum=0.9),
# loss='categorical_crossentropy',
# metrics=['accuracy'])
With the loss function measuring how wrong the model is and the optimizer knowing how to adjust the weights to reduce that error, we now need to understand the mechanism that calculates the necessary adjustments: backpropagation.
© 2025 ApX Machine Learning