Having established the general principle of gradient descent, iteratively adjusting parameters θ to minimize a cost function J(θ) by moving in the direction opposite the gradient ∇J(θ), we now examine the first specific variant: Batch Gradient Descent (BGD). The name "Batch" refers to its defining characteristic: it processes the entire training dataset as a single batch in each step of the algorithm.
How Batch Gradient Descent Works
Imagine you have a dataset containing m training examples. To perform one update of the parameters θ using Batch Gradient Descent, you follow these steps:
- Compute Predictions: For the current parameter values θ, calculate the model's prediction for every example in the training set.
- Calculate Errors: Determine the error (the difference between the prediction and the actual target value) for each training example.
- Compute the Gradient: Calculate the gradient of the cost function, ∇J(θ). This crucial step involves averaging the contributions to the gradient from all m training examples. If Ji(θ) represents the cost associated with the i-th training example, the overall gradient is typically computed as:
∇J(θ)=m1∑i=1m∇Ji(θ)
This gradient represents the average direction of steepest ascent across the entire dataset for the current parameters.
- Update Parameters: Adjust the parameters θ by taking a step downhill, scaled by the learning rate α:
θ:=θ−α∇J(θ)
This entire cycle, processing all m examples to compute one gradient and perform one parameter update, constitutes a single iteration or epoch (though sometimes "epoch" strictly refers to one full pass over the data, which aligns perfectly with one BGD iteration). The algorithm repeats these steps until some convergence criterion is met, such as the change in the cost function becoming very small or reaching a maximum number of iterations.
A single step in Batch Gradient Descent involves computing the gradient based on the entire training dataset before updating the model parameters.
Characteristics of Batch Gradient Descent
Advantages:
- Stable Convergence: Because the gradient is computed over the entire dataset, it provides a reliable estimate of the true gradient of the overall cost function. This generally leads to a smoother convergence path towards the minimum, especially for convex cost functions where it's guaranteed to find the global minimum (given an appropriate learning rate). The updates are less noisy compared to other variants.
- Parallelization: The calculation of the gradient contribution for each example (∇Ji(θ)) can often be parallelized across examples, speeding up the gradient computation step on suitable hardware.
Disadvantages:
- Computational Cost: The primary drawback is the computational expense per iteration. Calculating the gradient requires processing every single training example before making even one update to the parameters. This becomes prohibitively slow and resource-intensive for very large datasets common in modern machine learning.
- Memory Requirements: The algorithm might need to load the entire dataset into memory to compute the gradient, which can be infeasible for datasets that exceed available RAM.
- No Online Learning: BGD requires the entire dataset upfront and cannot easily incorporate new data points as they arrive (online learning). The model needs to be retrained from scratch with the augmented dataset.
When to Use Batch Gradient Descent
Despite its limitations with large datasets, BGD is conceptually simple and can be effective when:
- The dataset is small enough to fit comfortably in memory and be processed within a reasonable time frame.
- A smooth and stable convergence path is preferred over faster, potentially noisier updates.
- The cost function is known to be convex, ensuring convergence to the global minimum.
Batch Gradient Descent provides a foundational understanding of how gradients guide optimization. However, its computational demands paved the way for more scalable alternatives like Stochastic Gradient Descent and Mini-batch Gradient Descent, which we will explore next. These methods trade off the stability of the full batch gradient for faster updates.