Training autoencoders, especially the more complex architectures discussed previously or those with intricate loss functions like the Evidence Lower Bound (ELBO) in VAEs, involves minimizing a potentially high-dimensional, non-convex loss function, L. While basic Stochastic Gradient Descent (SGD) provides a foundation, its convergence can be slow, prone to getting stuck in suboptimal local minima or saddle points, and highly sensitive to the choice of learning rate. To effectively train deep autoencoder models and find meaningful representations, we often turn to more sophisticated optimization algorithms that adapt the learning process.
These advanced optimizers aim to navigate the loss landscape more intelligently, accelerating convergence and improving the quality of the final model. They typically achieve this by incorporating information beyond the current gradient, such as past gradients or their magnitudes.
Standard SGD updates parameters solely based on the gradient calculated for the current mini-batch. This can lead to noisy updates and oscillations, particularly in areas where the loss surface curves steeply in one direction but gently in another (common in deep networks).
Momentum addresses this by adding a fraction of the previous update vector to the current one. Think of it like a ball rolling down a hill: it accumulates momentum, smoothing out the path and speeding up movement in consistent directions while dampening oscillations across directions with rapidly changing gradients.
The update incorporates a "velocity" term, vt, which is an exponentially decaying moving average of past gradients:
vt=βvt−1+(1−β)∇θL(θt) θt+1=θt−αvt
Here, θ represents the model parameters, α is the learning rate, ∇θL(θt) is the gradient of the loss function L with respect to parameters θ at step t, and β is the momentum coefficient (typically set around 0.9). This allows the optimizer to "remember" past gradient directions and continue moving in those directions even if the current gradient is small or noisy.
A fixed learning rate α poses challenges. If it's too small, training is slow. If it's too large, the optimizer might overshoot minima or diverge. Adaptive learning rate algorithms adjust the learning rate during training, often on a per-parameter basis.
AdaGrad adapts the learning rate for each parameter individually, performing smaller updates for parameters associated with frequently occurring features and larger updates for parameters associated with infrequent features. It achieves this by dividing the learning rate by the square root of the sum of all past squared gradients for that parameter.
While effective in some scenarios, AdaGrad's main drawback is that the accumulated sum of squared gradients in the denominator grows continuously during training. This causes the learning rate to shrink monotonically, sometimes becoming infinitesimally small before convergence is truly reached, effectively stopping learning prematurely.
RMSprop modifies AdaGrad to resolve its monotonically decreasing learning rate issue. Instead of accumulating all past squared gradients, RMSprop uses an exponentially decaying average of squared gradients. This means recent gradient information is weighted more heavily, preventing the denominator from growing indefinitely and allowing the learning rate to adapt more dynamically.
The update involves maintaining a moving average of squared gradients, E[g2]t:
E[g2]t=γE[g2]t−1+(1−γ)(∇θL(θt))2 θt+1=θt−E[g2]t+ϵα∇θL(θt)
Here, γ is the decay rate (similar to β in momentum, often around 0.9), and ϵ is a small constant for numerical stability (e.g., 10−8).
Adam is arguably one of the most popular and often default optimization algorithms used in deep learning today. It combines the ideas of both momentum (using the first moment estimate, like momentum) and adaptive learning rates (using the second moment estimate, like RMSprop).
Adam maintains two exponentially decaying moving averages:
mt=β1mt−1+(1−β1)∇θL(θt) vt=β2vt−1+(1−β2)(∇θL(θt))2
Since mt and vt are initialized as zeros, they are biased towards zero, especially during the initial steps. Adam performs bias correction to counteract this:
m^t=1−β1tmt v^t=1−β2tvt
The final parameter update uses these bias-corrected estimates:
θt+1=θt−v^t+ϵαm^t
Common default values are β1=0.9, β2=0.999, and ϵ=10−8. Adam often works well with minimal tuning across a wide range of problems, including training complex autoencoders.
A refinement often seen in practice is AdamW. It modifies the standard Adam algorithm by decoupling the weight decay (L2 regularization) from the gradient update step. In standard Adam, weight decay becomes implicitly linked to the adaptive learning rates, which can sometimes lead to suboptimal regularization. AdamW applies the weight decay directly to the weights after the Adam update step, which often results in better generalization performance. Many deep learning libraries now offer AdamW as a distinct optimizer.
Conceptual illustration of how different optimizers might navigate a loss landscape. SGD can oscillate, Momentum smooths the path, and adaptive methods like Adam often converge faster and more reliably.
For training autoencoders, Adam or AdamW are typically excellent starting points due to their robustness and generally good performance across various architectures and datasets. RMSprop can also be effective. While standard SGD with momentum can work, it often requires more careful tuning of the learning rate and momentum parameter.
Implementing these optimizers is straightforward in modern deep learning frameworks like TensorFlow and PyTorch:
# PyTorch Example
import torch.optim as optim
# model is your defined autoencoder architecture
# learning_rate is your chosen learning rate
# Using AdamW
optimizer = optim.AdamW(model.parameters(), lr=learning_rate, weight_decay=0.01)
# Using RMSprop
# optimizer = optim.RMSprop(model.parameters(), lr=learning_rate, alpha=0.9)
# Using SGD with Momentum
# optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9)
# --- Training loop ---
# for data in dataloader:
# inputs, _ = data
# optimizer.zero_grad()
# outputs, latent_vars = model(inputs) # Or however your model forward pass works
# loss = calculate_loss(outputs, inputs, ...) # Calculate appropriate loss
# loss.backward()
# optimizer.step()
# ---------------------
# TensorFlow/Keras Example
import tensorflow as tf
# model is your defined Keras autoencoder model
# learning_rate is your chosen learning rate
# Using AdamW (available via tensorflow_addons or built-in in newer TF versions)
# from tensorflow.keras.optimizers import AdamW # Check TF version
# optimizer = AdamW(learning_rate=learning_rate, weight_decay=0.01)
# Using Adam (more common)
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
# Using RMSprop
# optimizer = tf.keras.optimizers.RMSprop(learning_rate=learning_rate, rho=0.9)
# Using SGD with Momentum
# optimizer = tf.keras.optimizers.SGD(learning_rate=learning_rate, momentum=0.9)
# model.compile(optimizer=optimizer, loss=your_loss_function)
# model.fit(x_train, x_train, epochs=num_epochs, batch_size=batch_size, ...)
While default hyperparameters for optimizers like Adam (β1=0.9,β2=0.999) often work well, the learning rate (α) remains a critical hyperparameter that usually requires tuning. The effectiveness of an optimizer is also intertwined with other aspects of training, such as batch size, regularization strength, and importantly, the learning rate schedule, which we will discuss next. Selecting the right optimizer and tuning it appropriately are significant steps toward successfully training performant autoencoder models.
© 2025 ApX Machine Learning