While Batch Gradient Descent offers a direct path towards the minimum by averaging the gradient over the entire dataset, it becomes computationally expensive, sometimes prohibitively so, when dealing with very large datasets. Imagine calculating the gradient contribution from millions or even billions of data points just to make a single update step! Stochastic Gradient Descent (SGD) provides a practical alternative by dramatically reducing the computation required for each parameter update.
Instead of using the entire dataset, SGD updates the parameters using the gradient calculated from only one randomly chosen training example (x(i),y(i)) in each iteration.
The core idea is simple: take a single example, compute the cost and gradient just for that example, and update the parameters immediately. Then, pick another random example and repeat.
Mathematically, the update rule for a single parameter θj looks like this:
θj:=θj−α∇θjJ(θ;x(i),y(i))
Here, J(θ;x(i),y(i)) represents the cost function calculated for the single training example (x(i),y(i)), and ∇θjJ(θ;x(i),y(i)) is the gradient of that cost with respect to the parameter θj, evaluated using only that single example. Notice the absence of the summation and the m1 term seen in Batch Gradient Descent.
The term "stochastic" refers to the randomness involved in selecting a single data point for each update. Because each update is based on only one example, the gradient calculation ∇θjJ(θ;x(i),y(i)) is a "noisy" estimate of the true gradient ∇θjJ(θ) (which would be averaged over all examples).
This means the path taken by SGD towards the minimum is not as smooth as Batch GD. Instead of marching directly downhill, SGD tends to zig-zag and fluctuate. While this might seem inefficient, it's precisely this noisy behavior that offers some advantages.
Comparison of optimization paths on a simple quadratic cost function. Batch Gradient Descent takes a smoother, more direct route, while Stochastic Gradient Descent exhibits a noisier, fluctuating path towards the minimum.
SGD represents a trade-off: it sacrifices the smooth convergence of Batch GD for much faster individual updates, making large-scale machine learning feasible. Its noisy nature, while sometimes problematic, can also be beneficial. However, the high variance in updates leads naturally to a middle ground: Mini-batch Gradient Descent, which we will discuss next.
© 2025 ApX Machine Learning