Additive modeling frameworks build models sequentially. The standard Gradient Boosting Machine (GBM) algorithm constructs such sequences by minimizing a loss function using a process analogous to gradient descent, but operating in function space.
Our objective is to find a function that minimizes the expected value of some specified loss function , where is the true target value and is our model's prediction. Since we work with a finite training set , we aim to minimize the empirical loss:
Gradient Boosting builds the model in an additive manner:
Here, is an initial guess for the model (often the mean of the target values for regression, or the log-odds for classification), and are the base learners (typically decision trees) added sequentially at each iteration . The term representing the step size or weight is often absorbed into the learner or handled by a separate learning rate parameter, as we will see.
Suppose we have already built the model up to iteration , resulting in . We want to find the next base learner such that adding it to the current model improves the fit and reduces the overall loss:
Ideally, we want to point in the direction that maximally reduces the loss function . Consider the loss at iteration :
Think of this as taking a step from the current position in function space. In standard gradient descent, we update parameters by moving in the direction of the negative gradient of the loss function with respect to those parameters. In Gradient Boosting, we do something analogous: we find a function that points in the direction of the negative gradient of the loss function , evaluated with respect to the current model's predictions for each data point .
Let's calculate the gradient of the loss function with respect to the function values at each point , evaluated at the current model :
The direction we want to move in for each observation is the negative gradient, . These negative gradient values are often called pseudo-residuals:
Why pseudo-residuals? If we use the squared error loss , the gradient is . The negative gradient is then , which is exactly the residual (the difference between the true value and the prediction). For other loss functions, behaves like a residual, indicating the direction and magnitude of error for each point according to that specific loss function.
The core idea is to fit the next base learner, , to approximate these pseudo-residuals . That is, we train using the original features but with as the target variable:
Usually, is constrained to be a shallow decision tree (e.g., CART). The tree is built to predict the pseudo-residuals based on the input features.
Once the structure of the tree (i.e., the splits) is determined by fitting to the pseudo-residuals, the optimal values for the terminal nodes (leaves) of the tree need to be determined. Instead of simply using the average pseudo-residual in each leaf, we can find the constant value for each leaf that minimizes the original loss function for the samples falling into that leaf:
This step ensures that the contribution from the new tree directly minimizes the loss function, given the current model . For some loss functions (like squared error), this simplifies to the average of the pseudo-residuals in the leaf, but for others (like Log Loss or Absolute Error), it requires a different calculation (e.g., median for Absolute Error). The base learner is then defined such that if .
Finally, we update the overall model by adding the newly trained base learner , scaled by a learning rate (also known as shrinkage):
The learning rate (typically a small value between 0.01 and 0.3) scales the contribution of each new tree. This reduces the impact of individual trees and helps prevent overfitting, forcing the model to build its prediction more gradually over many iterations. It effectively provides regularization.
We can now outline the steps of the generic Gradient Boosting algorithm:
Initialize Model: Start with an initial constant prediction .
Iterate for to (Number of Trees): a. Compute Pseudo-Residuals: For each sample : b. Fit Base Learner: Fit a base learner (e.g., a regression tree) to the pseudo-residuals . This determines the regions (leaves) of the tree. c. Compute Optimal Leaf Values: For each terminal node (leaf) of the tree : d. Update Model: Update the ensemble model using the learning rate : (Where is the indicator function. Effectively, now represents the tree with optimal leaf values .)
Output Final Model: The final prediction is given by . For classification, this might be converted to probabilities (e.g., using the sigmoid function for binary classification).
Iterative process of the Gradient Boosting Machine (GBM) algorithm. Each iteration involves computing gradients (pseudo-residuals), fitting a base learner to these gradients, optimizing the learner's contribution, and updating the ensemble model.
This step-by-step derivation shows how GBM iteratively refines its predictions by sequentially adding base learners trained to correct the errors (as defined by the negative gradient of the loss function) of the existing ensemble. The choice of loss function dictates the specific form of the pseudo-residuals and potentially the leaf optimization step, allowing GBM to be adapted to various regression and classification tasks, which we will explore next.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•