After selecting a loss function to quantify how far your model's predictions are from the true targets, the next essential component in configuring the training process via model.compile()
is the optimizer. If the loss function tells us how wrong the model is, the optimizer dictates how the model should adjust itself to become less wrong.
Think of training a neural network as trying to find the lowest point in a complex, high-dimensional landscape, where the height at any point represents the loss value for a given set of model parameters (weights and biases). The optimizer is the strategy you use to navigate this landscape and descend towards a minimum loss.
Most optimization strategies in deep learning are variants of gradient descent. The fundamental principle is straightforward:
This process is repeated iteratively, ideally guiding the parameters towards values that minimize the loss. The "fraction" mentioned in step 3 is controlled by a crucial hyperparameter called the learning rate (often denoted as α or η).
parameternew=parameterold−learning_rate×gradientThe learning rate determines the size of the steps taken during the descent.
Choosing an appropriate learning rate is critical for successful training.
While basic gradient descent provides the foundation, several more sophisticated optimizers have been developed to improve convergence speed and stability. TensorFlow's Keras API provides easy access to many of them. Here are some of the most frequently used:
This is the classic optimizer. Instead of calculating the gradient using the entire dataset (which is computationally expensive), SGD estimates the gradient using a small random subset of the data called a mini-batch.
Keras's SGD
optimizer often includes enhancements:
import tensorflow as tf
# Basic SGD
sgd_optimizer_basic = tf.keras.optimizers.SGD(learning_rate=0.01)
# SGD with momentum
sgd_optimizer_momentum = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9)
# SGD with Nesterov momentum
sgd_optimizer_nesterov = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9, nesterov=True)
Adam is often the default choice for many deep learning tasks due to its effectiveness and relative ease of use. It's an adaptive learning rate optimizer, meaning it computes individual learning rates for different parameters.
It does this by keeping track of exponentially decaying averages of past gradients (first moment, like momentum) and past squared gradients (second moment, capturing the variance).
Key hyperparameters include learning_rate
, beta_1
(decay rate for the first moment), beta_2
(decay rate for the second moment), and epsilon
(a small value to prevent division by zero).
import tensorflow as tf
# Adam optimizer with default parameters (learning_rate=0.001)
adam_optimizer_default = tf.keras.optimizers.Adam()
# Adam optimizer with a custom learning rate
adam_optimizer_custom = tf.keras.optimizers.Adam(learning_rate=0.0005)
# Adam optimizer with custom beta values
adam_optimizer_betas = tf.keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.99)
RMSprop is another adaptive learning rate method that also maintains a moving average of the squared gradients. It divides the learning rate by the square root of this average. This effectively adapts the learning rate per parameter, decreasing it for parameters with large gradients and increasing it for parameters with small gradients.
import tensorflow as tf
# RMSprop optimizer with default parameters (learning_rate=0.001)
rmsprop_optimizer_default = tf.keras.optimizers.RMSprop()
# RMSprop optimizer with custom learning rate and momentum
rmsprop_optimizer_custom = tf.keras.optimizers.RMSprop(learning_rate=0.001, rho=0.9, momentum=0.1)
Keras offers other optimizers like Adagrad
, Adadelta
, Adamax
, and Nadam
. While less frequently used as initial choices compared to Adam or SGD, they have specific properties that might be beneficial for certain types of data or network architectures (e.g., Adagrad for sparse data).
Selecting the best optimizer often involves some experimentation:
tf.keras.optimizers.schedules
can implement this.model.compile()
You integrate your chosen optimizer when compiling the model. You can specify the optimizer using its string identifier (if using default parameters) or by creating an optimizer instance (if you need to customize parameters like the learning rate).
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# Assume 'model' is a defined Keras model (e.g., Sequential or Functional)
# model = keras.Sequential([...])
# Using string identifier (uses default parameters)
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
# Using an optimizer instance with a custom learning rate
custom_adam = tf.keras.optimizers.Adam(learning_rate=0.0005)
model.compile(optimizer=custom_adam,
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
# Using SGD with momentum
custom_sgd = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9)
model.compile(optimizer=custom_sgd,
loss='mean_squared_error', # Example for regression
metrics=['mae']) # Mean Absolute Error metric
By choosing an appropriate optimizer and configuring its parameters (especially the learning rate), you provide the mechanism for your model to effectively learn from the data and minimize the chosen loss function. Along with the loss and metrics, the optimizer completes the core configuration needed before initiating the training process with model.fit()
.
© 2025 ApX Machine Learning