In standard machine learning algorithms, like linear regression or neural networks with fixed architectures, optimization often involves finding the best set of parameters θ for a predefined model structure f(x;θ). We use techniques like gradient descent to iteratively update these parameters to minimize a loss function L(y,f(x;θ)) averaged over the training data. The update rule typically looks like θnew=θold−η∇θL, where ∇θL is the gradient of the loss with respect to the parameters.
Gradient Boosting Machines (GBMs) approach optimization from a different perspective. Instead of optimizing parameters within a fixed model structure, GBMs build the model itself iteratively. The "parameter" we are optimizing is the entire function F(x) representing the ensemble. We are searching through the space of possible functions to find the one that best minimizes the overall loss on the training data. This optimization process can be understood as performing gradient descent in function space.
Imagine a high-dimensional space where each point corresponds to a specific function F. Our goal is to find the function F∗ in this space that minimizes the total loss over our dataset (xi,yi) for i=1,…,N:
Ltotal(F)=i=1∑NL(yi,F(xi))Just like in parameter space, we want to iteratively improve our current estimate of the optimal function. Let Fm−1(x) be our current ensemble model after m−1 boosting iterations. We want to find a new function hm(x) (our base learner, typically a decision tree) such that adding it to the current model moves us closer to the minimum loss. That is, we want:
Fm(x)=Fm−1(x)+ηhm(x)where η is a step size (the learning rate), and hm(x) points in a direction that reduces the loss Ltotal.
In standard gradient descent, the negative gradient −∇θL tells us the direction in parameter space that leads to the steepest decrease in loss. What is the equivalent in function space?
We can think of the predictions F(xi) for each training sample i as the "coordinates" defining our current position in function space. The gradient of the total loss Ltotal with respect to these coordinates is a vector where the i-th component is the partial derivative of the individual loss term L(yi,F(xi)) with respect to the model's prediction F(xi), evaluated at the current model Fm−1:
gim=[∂F(x)∂L(yi,F(x))]F(x)=Fm−1(xi)This vector (g1m,g2m,…,gNm) represents the direction of steepest ascent for the total loss in function space at the point defined by Fm−1.
To minimize the loss, we need to move in the direction opposite to the gradient. We define the negative gradient components, often called pseudo-residuals, for each sample i at iteration m:
rim=−gim=−[∂F(x)∂L(yi,F(x))]F(x)=Fm−1(xi)These pseudo-residuals rim represent the target values that our next base learner hm(x) should approximate. Why "pseudo-residuals"?
Consider the case of regression with Squared Error loss: L(y,F(x))=21(y−F(x))2. The gradient component is:
∂F(x)∂L(yi,F(x))=∂F(x)∂(21(yi−F(x))2)=−(yi−F(x))Evaluating this at Fm−1(xi) gives −(yi−Fm−1(xi)). The negative gradient is therefore:
rim=−[−(yi−Fm−1(xi))]=yi−Fm−1(xi)In this specific case, the pseudo-residuals are exactly the ordinary residuals (the difference between the true value and the current model's prediction). For other loss functions, rim is not the simple residual but still represents the direction (in function space, evaluated pointwise) where the current model Fm−1 needs the most improvement.
We cannot directly "add" the vector of pseudo-residuals (r1m,…,rNm) to our function Fm−1(x). Instead, we fit a base learner hm(x) to predict these pseudo-residuals based on the input features xi:
hm=arghmini=1∑N(rim−h(xi))2This base learner hm(x) provides an approximation of the desired negative gradient step across the entire input domain, not just at the training points.
Finally, we update our ensemble model by adding this new learner, scaled by the learning rate η:
Fm(x)=Fm−1(x)+ηhm(x)This update performs an approximate gradient descent step in function space. By iteratively computing pseudo-residuals based on the current ensemble's errors and fitting new base learners to these pseudo-residuals, the GBM algorithm progressively minimizes the overall loss function. This functional gradient view provides a powerful unifying framework for understanding how gradient boosting works across different loss functions and base learners.
© 2025 ApX Machine Learning