To build models additively, where each new component corrects the errors of the ensemble so far, we need a systematic way to determine what correction to make. This is where optimization techniques come into play, and one of the most fundamental is gradient descent. While you likely have encountered gradient descent when training models like linear regression or neural networks by adjusting weight parameters, its application in gradient boosting is slightly different but related. Understanding the core mechanism is essential before we see how it's adapted for boosting.
At the heart of supervised learning lies the objective of minimizing a loss function, often denoted as L(y,F(x)). This function quantifies the discrepancy between the true target values (y) and the predictions made by our model (F(x)). Common examples include Mean Squared Error (MSE) for regression or Log Loss for classification. Our goal is to find a model F(x) that makes this loss value as small as possible across our training data.
Imagine the loss function as a landscape with hills and valleys, where the altitude at any point represents the loss value for a given model configuration. We want to find the lowest point in this landscape, the minimum of the loss function. Gradient descent provides a simple iterative strategy:
How do we find the direction of steepest descent? Calculus tells us that the gradient of the loss function, denoted ∇L, points in the direction of the steepest ascent. Therefore, to move downhill, we simply move in the opposite direction of the gradient, i.e., along −∇L.
In the context of optimizing model parameters θ, the basic gradient descent update rule at iteration t is:
θt+1=θt−η∇L(θt)Let's break this down:
The process is repeated, iteratively adjusting the parameters to progressively lower the loss.
The choice of the learning rate η is significant for the algorithm's performance:
Finding an appropriate learning rate often involves experimentation and tuning. In the context of gradient boosting, this learning rate is often referred to as shrinkage, and it plays a dual role in controlling the learning process and acting as a form of regularization, which we will explore later.
A simplified 1D example (L=x2) showing how different learning rates (η) affect the steps taken by gradient descent starting from x=4. The goal is to reach x=0. A good rate converges efficiently, a small rate converges slowly, a large rate overshoots, and a rate that is too large can diverge.
While the basic update rule uses the gradient calculated on the entire dataset (Batch Gradient Descent), variants exist. Stochastic Gradient Descent (SGD) calculates the gradient based on a single data point, and Mini-batch Gradient Descent uses a small subset of the data. These can be computationally cheaper and sometimes help escape shallow local minima, although standard GBM often computes gradients over the full dataset unless stochastic subsampling is explicitly used.
Now, how does this relate to building an ensemble of trees additively? Instead of optimizing a fixed set of parameters θ, Gradient Boosting performs optimization in function space. We start with an initial simple model (e.g., the mean of the target values for regression). Then, at each iteration m, we want to find a new function hm(x) (our base learner, typically a decision tree) to add to the current ensemble Fm−1(x) such that the overall loss is reduced:
Fm(x)=Fm−1(x)+ηhm(x)Gradient Boosting uses the gradient descent idea here: it calculates the negative gradient of the loss function L(y,F(x)) with respect to the current model's predictions Fm−1(x), evaluated for each training instance i:
rim=−[∂F(xi)∂L(yi,F(xi))]F(x)=Fm−1(x)These negative gradients, rim, often called pseudo-residuals, represent the "direction" in function space where the loss is decreasing most rapidly for each data point, given the current ensemble's predictions. The algorithm then fits the new base learner hm(x) to approximate these pseudo-residuals. Essentially, we are using gradient descent to guide the sequential construction of our ensemble model, telling us what errors (as defined by the negative gradient of the loss) the next tree should focus on correcting.
This view of boosting as gradient descent in function space is a central concept that underpins the algorithms discussed in this course. We will formalize this further in the next chapter when we derive the generic Gradient Boosting Machine algorithm. For now, the significant takeaway is that gradient descent provides the optimization machinery for iteratively improving the ensemble model by adding base learners that target the remaining errors, as indicated by the gradient of the loss function.
© 2025 ApX Machine Learning