To train a deep learning model, our objective is to find the set of parameters (weights and biases) that minimizes a specific loss function for our given data. Optimization algorithms are the tools we use to perform this search efficiently. The most fundamental algorithm, forming the bedrock upon which many others are built, is Gradient Descent.
You'll likely recall Gradient Descent from introductory machine learning concepts. Its core idea is straightforward: iteratively adjust the model's parameters in the direction that maximally reduces the loss function. How do we know which direction reduces the loss the most? We use the gradient.
The loss function, let's call it J(θ), measures how poorly our model performs with its current parameters θ. This function lives in a high-dimensional space (one dimension for each parameter). We want to find the point in this space corresponding to the lowest loss.
The gradient of the loss function with respect to the parameters, denoted as ∇J(θ), is a vector pointing in the direction of the steepest ascent of the loss function at point θ. Since we want to minimize the loss, we should move in the direction opposite to the gradient.
Imagine you are standing on a foggy hill and want to get to the valley floor (minimum elevation, or minimum loss). You can't see the whole landscape, but you can feel the slope under your feet. The gradient tells you the direction of the steepest uphill path. To go down, you take a step in the exact opposite direction.
Gradient Descent formalizes this intuition with a simple update rule. In each step, we update the parameters θ as follows:
θnew=θold−α∇J(θold)Let's break this down:
We repeat this update process iteratively, calculating the gradient and updating the parameters, hoping to converge to a set of parameters θ that results in a low value for the loss function J(θ).
A simplified 2D loss surface (contours represent equal loss values). The red line shows the path taken by Gradient Descent, starting from an initial point and taking steps opposite to the gradient towards the minimum (darkest blue area).
The "standard" version of Gradient Descent, often referred to as Batch Gradient Descent (BGD), has a specific characteristic: to calculate the gradient ∇J(θ) in each update step, it processes the entire training dataset. It computes the average loss and the average gradient over all training examples before making a single parameter update.
This gives a very accurate estimate of the true gradient for the entire dataset, leading to a smooth convergence path towards a minimum. However, as mentioned in the chapter introduction, computing the gradient over the entire dataset can be computationally prohibitive for the massive datasets commonly used in deep learning. Imagine calculating predictions and gradients for millions of images just to make one tiny adjustment to the model's weights!
This computational burden is a primary motivation for the variations of Gradient Descent we will explore next, such as Stochastic Gradient Descent (SGD) and Mini-batch Gradient Descent. These methods offer ways to get faster updates, albeit with some trade-offs, making the optimization process feasible for large-scale deep learning. Understanding the mechanics of Batch Gradient Descent provides the necessary foundation for appreciating why these variations were developed and how they work.
© 2025 ApX Machine Learning