Standard machine learning algorithms, like linear regression or neural networks with fixed architectures, often optimize by finding the best set of parameters for a predefined model structure . Techniques like gradient descent are used to iteratively update these parameters to minimize a loss function averaged over the training data. The update rule typically looks like , where is the gradient of the loss with respect to the parameters.
Gradient Boosting Machines (GBMs) approach optimization from a different perspective. Instead of optimizing parameters within a fixed model structure, GBMs build the model itself iteratively. The "parameter" we are optimizing is the entire function representing the ensemble. We are searching through the space of possible functions to find the one that best minimizes the overall loss on the training data. This optimization process can be understood as performing gradient descent in function space.
Imagine a high-dimensional space where each point corresponds to a specific function . Our goal is to find the function in this space that minimizes the total loss over our dataset for :
Just like in parameter space, we want to iteratively improve our current estimate of the optimal function. Let be our current ensemble model after boosting iterations. We want to find a new function (our base learner, typically a decision tree) such that adding it to the current model moves us closer to the minimum loss. That is, we want:
where is a step size (the learning rate), and points in a direction that reduces the loss .
In standard gradient descent, the negative gradient tells us the direction in parameter space that leads to the steepest decrease in loss. What is the equivalent in function space?
We can think of the predictions for each training sample as the "coordinates" defining our current position in function space. The gradient of the total loss with respect to these coordinates is a vector where the -th component is the partial derivative of the individual loss term with respect to the model's prediction , evaluated at the current model :
This vector represents the direction of steepest ascent for the total loss in function space at the point defined by .
To minimize the loss, we need to move in the direction opposite to the gradient. We define the negative gradient components, often called pseudo-residuals, for each sample at iteration :
These pseudo-residuals represent the target values that our next base learner should approximate. Why "pseudo-residuals"?
Consider the case of regression with Squared Error loss: . The gradient component is:
Evaluating this at gives . The negative gradient is therefore:
In this specific case, the pseudo-residuals are exactly the ordinary residuals (the difference between the true value and the current model's prediction). For other loss functions, is not the simple residual but still represents the direction (in function space, evaluated pointwise) where the current model needs the most improvement.
We cannot directly "add" the vector of pseudo-residuals to our function . Instead, we fit a base learner to predict these pseudo-residuals based on the input features :
This base learner provides an approximation of the desired negative gradient step across the entire input domain, not just at the training points.
Finally, we update our ensemble model by adding this new learner, scaled by the learning rate :
This update performs an approximate gradient descent step in function space. By iteratively computing pseudo-residuals based on the current ensemble's errors and fitting new base learners to these pseudo-residuals, the GBM algorithm progressively minimizes the overall loss function. This functional gradient view provides a powerful unifying framework for understanding how gradient boosting works across different loss functions and base learners.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•