While the concept of gradient descent using the entire dataset (often called Batch Gradient Descent) provides a direct path towards minimizing the loss function, it comes with a significant computational cost. Calculating the gradient requires processing every single training example before making even one update to the network's weights and biases. For datasets with millions of examples, this becomes prohibitively slow and memory-intensive.
To address this, several variants of gradient descent have been developed. These methods aim to achieve faster convergence and handle large datasets more effectively.
Stochastic Gradient Descent (SGD) takes the opposite approach to Batch Gradient Descent. Instead of calculating the gradient based on the entire dataset, SGD updates the parameters using the gradient computed from just one randomly selected training example at each step.
The update rule for a weight W looks similar to Batch GD, but the loss and gradient ∇Loss are computed using only a single sample (x(i),y(i)):
Wnew=Wold−η∇Loss(Wold;x(i),y(i))
Advantages:
Disadvantages:
Imagine navigating down a foggy mountain. Batch GD carefully surveys the entire area around it before taking a step in the best direction. SGD takes a quick look in one direction (based on one piece of information) and immediately takes a step, repeating this rapidly. It moves much faster initially but might zig-zag a lot.
{"layout": {"title": "Gradient Descent Path Comparison", "xaxis": {"title": "Parameter 1", "range": [-1, 5]}, "yaxis": {"title": "Parameter 2", "range": [-1, 5]}, "showlegend": true, "width": 600, "height": 400, "contours": {"coloring": "heatmap", "showlabels": true, "labelfont": {"size": 10, "color": "white"}}}, "data": [{"type": "contour", "z": [[(i-2)**2 + (j-2)**2 for i in range(6)] for j in range(6)], "x": [0, 1, 2, 3, 4, 5], "y": [0, 1, 2, 3, 4, 5], "colorscale": "Blues", "reversescale": true, "showscale": false}, {"type": "scatter", "x": [0.5, 1.0, 1.4, 1.7, 1.9, 2.0], "y": [0.5, 1.0, 1.4, 1.7, 1.9, 2.0], "mode": "lines+markers", "name": "Batch GD", "line": {"color": "#1c7ed6"}, "marker": {"size": 6}}, {"type": "scatter", "x": [0.5, 1.5, 1.0, 2.2, 1.8, 2.5, 1.9, 2.1, 2.0], "y": [0.5, 0.2, 1.2, 1.5, 2.5, 2.2, 1.8, 2.3, 2.1], "mode": "lines+markers", "name": "SGD", "line": {"color": "#fa5252"}, "marker": {"size": 6}}, {"type": "scatter", "x": [0.5, 1.2, 1.3, 1.8, 1.9, 2.0], "y": [0.5, 0.8, 1.5, 1.7, 2.2, 2.1], "mode": "lines+markers", "name": "Mini-batch GD", "line": {"color": "#40c057"}, "marker": {"size": 6}}]}
A simplified view of optimization paths on a loss surface. Batch GD takes a direct route. SGD's path is noisy due to single-sample updates. Mini-batch GD offers a balance.
Mini-batch Gradient Descent strikes a balance between the stability of Batch GD and the speed of SGD. It computes the gradient and updates the parameters based on a small, randomly selected subset (a "mini-batch") of the training data. Common batch sizes range from 32 to 512, but can vary depending on the application and hardware capabilities.
The update rule uses the average gradient over the m samples in the mini-batch:
Wnew=Wold−ηm1∑i=1m∇Loss(Wold;x(i),y(i))
Advantages:
Disadvantages:
Mini-batch GD is like navigating the mountain by taking readings from a small area around you (the mini-batch) before taking a step. It's faster than surveying everything (Batch GD) and less erratic than only looking in one spot (SGD).
While Mini-batch GD is effective, researchers have developed more sophisticated optimization algorithms that often lead to faster convergence and better results. These typically address two main challenges: navigating difficult loss landscapes (e.g., ravines, saddle points) and choosing an appropriate learning rate.
Imagine a ball rolling down a hill. It doesn't just stop instantly; it accumulates momentum, helping it roll over small bumps and accelerate down slopes. The Momentum update incorporates a similar idea into gradient descent. It adds a fraction β (the momentum coefficient, typically around 0.9) of the previous update vector to the current gradient step.
Let vt be the update vector (velocity) at step t:
vt=βvt−1+η∇Loss(Wt−1) Wt=Wt−1−vt
(Note: Sometimes the learning rate η is factored differently, e.g., vt=βvt−1+(1−β)∇Loss, and the update is Wt=Wt−1−αvt. The core idea remains.)
Momentum helps accelerate SGD in the relevant direction and dampens oscillations, especially useful in landscapes with narrow ravines where standard gradient descent might bounce back and forth.
Instead of using a single, fixed learning rate η for all parameters, adaptive methods adjust the learning rate during training, often per parameter. They typically give smaller updates to parameters associated with frequently occurring features and larger updates to parameters associated with infrequent features.
RMSprop (Root Mean Square Propagation): This method maintains a moving average of the squared gradients for each parameter and divides the learning rate by the square root of this average. This effectively decreases the learning rate for parameters with consistently large gradients and increases it for parameters with small gradients. Let E[g2]t be the moving average of squared gradients at step t: E[g2]t=γE[g2]t−1+(1−γ)(∇Loss(Wt−1))2 The update rule becomes: Wt=Wt−1−E[g2]t+ϵη∇Loss(Wt−1) Here, γ is the decay rate (e.g., 0.9) and ϵ is a small constant (e.g., 10−8) to prevent division by zero.
Adam (Adaptive Moment Estimation): Adam is arguably one of the most popular and effective optimizers currently used. It combines the ideas of Momentum and RMSprop. It keeps track of both an exponentially decaying average of past gradients (first moment, mt) and an exponentially decaying average of past squared gradients (second moment, vt). mt=β1mt−1+(1−β1)∇Loss(Wt−1) vt=β2vt−1+(1−β2)(∇Loss(Wt−1))2 Adam also incorporates a bias correction step, which is particularly important during the initial stages of training when the moment estimates are initialized at zero: m^t=1−β1tmt v^t=1−β2tvt The final parameter update uses these bias-corrected estimates: Wt=Wt−1−v^t+ϵηm^t Common default values are β1=0.9, β2=0.999, and ϵ=10−8. Adam often works well with minimal hyperparameter tuning, making it a good default choice for many problems.
The best optimizer can be problem-dependent. While Adam is a great starting point, experimenting with others like SGD with Momentum or RMSprop might yield better performance for specific network architectures or datasets. Understanding these variants allows you to make informed choices when training your neural networks, moving beyond the basic gradient descent introduced earlier towards more practical and powerful training techniques.
© 2025 ApX Machine Learning