Deriving the Generic GBM Algorithm

Additive modeling frameworks build models sequentially. The standard Gradient Boosting Machine (GBM) algorithm constructs such sequences by minimizing a loss function using a process analogous to gradient descent, but operating in function space.

The Goal: Minimizing Loss with an Additive Model

Our objective is to find a function $F(x)$ that minimizes the expected value of some specified loss function $L(y, F(x))$ , where $y$ is the true target value and $F(x)$ is our model's prediction. Since we work with a finite training set $\{(x_i, y_i)\}_{i=1}^N$ , we aim to minimize the empirical loss:

\mathcal{L}(F) = \sum_{i=1}^{N} L(y_i, F(x_i))

Gradient Boosting builds the model $F(x)$ in an additive manner:

F(x) = F_M(x) = \sum_{m=0}^{M} h_m(x) = F_0(x) + \sum_{m=1}^{M} h_m(x)

Here, $F_0(x)$ is an initial guess for the model (often the mean of the target values for regression, or the log-odds for classification), and $h_m(x)$ are the base learners (typically decision trees) added sequentially at each iteration $m$ . The term $\beta_m$ representing the step size or weight is often absorbed into the learner $h_m(x)$ or handled by a separate learning rate parameter, as we will see.

The Iterative Step: Functional Gradient Descent

Suppose we have already built the model up to iteration $m-1$ , resulting in $F_{m-1}(x)$ . We want to find the next base learner $h_m(x)$ such that adding it to the current model improves the fit and reduces the overall loss:

F_m(x) = F_{m-1}(x) + h_m(x)

Ideally, we want $h_m(x)$ to point in the direction that maximally reduces the loss function $\mathcal{L}$ . Let's examine the loss at iteration $m$ :

\mathcal{L}(F_m) = \sum_{i=1}^{N} L(y_i, F_{m-1}(x_i) + h_m(x_i))

Think of this as taking a step $h_m(x)$ from the current position $F_{m-1}(x)$ in function space. In standard gradient descent, we update parameters by moving in the direction of the negative gradient of the loss function with respect to those parameters. In Gradient Boosting, we do something analogous: we find a function $h_m(x)$ that points in the direction of the negative gradient of the loss function $\mathcal{L}$ , evaluated with respect to the current model's predictions $F_{m-1}(x_i)$ for each data point $i$ .

Let's calculate the gradient of the loss function $L$ with respect to the function values $F(x_i)$ at each point $x_i$ , evaluated at the current model $F_{m-1}$ :

g_{im} = \left[ \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \right]_{F(x) = F_{m-1}(x)}

The direction we want to move in for each observation $i$ is the negative gradient, $-g_{im}$ . These negative gradient values are often called pseudo-residuals:

r_{im} = - g_{im} = - \left[ \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \right]_{F(x) = F_{m-1}(x)}

Why pseudo-residuals? If we use the squared error loss $L(y, F) = \frac{1}{2}(y - F)^2$ , the gradient is $\frac{\partial L}{\partial F} = -(y - F)$ . The negative gradient is then $-( -(y-F) ) = y - F$ , which is exactly the residual (the difference between the true value and the prediction). For other loss functions, $-g_{im}$ behaves like a residual, indicating the direction and magnitude of error for each point according to that specific loss function.

Fitting the Base Learner to Pseudo-Residuals

The core idea is to fit the next base learner, $h_m(x)$ , to approximate these pseudo-residuals $\{r_{im}\}_{i=1}^N$ . That is, we train $h_m(x)$ using the original features $x_i$ but with $r_{im}$ as the target variable:

h_m \approx \arg \min_{h} \sum_{i=1}^{N} (r_{im} - h(x_i))^2

Usually, $h_m(x)$ is constrained to be a shallow decision tree (e.g., CART). The tree is built to predict the pseudo-residuals based on the input features.

Optimizing the Contribution (Terminal Nodes)

Once the structure of the tree $h_m(x)$ (i.e., the splits) is determined by fitting to the pseudo-residuals, the optimal values $\gamma_{jm}$ for the terminal nodes (leaves) $R_{jm}$ of the tree need to be determined. Instead of simply using the average pseudo-residual in each leaf, we can find the constant value $\gamma_{jm}$ for each leaf $j$ that minimizes the original loss function $\mathcal{L}$ for the samples falling into that leaf:

\gamma_{jm} = \arg \min_{\gamma} \sum_{x_i \in R_{jm}} L(y_i, F_{m-1}(x_i) + \gamma)

This step ensures that the contribution from the new tree directly minimizes the loss function, given the current model $F_{m-1}$ . For some loss functions (like squared error), this simplifies to the average of the pseudo-residuals in the leaf, but for others (like Log Loss or Absolute Error), it requires a different calculation (e.g., median for Absolute Error). The base learner $h_m(x)$ is then defined such that $h_m(x_i) = \gamma_{jm}$ if $x_i \in R_{jm}$ .

Updating the Model with Shrinkage

Finally, we update the overall model $F(x)$ by adding the newly trained base learner $h_m(x)$ , scaled by a learning rate $\nu$ (also known as shrinkage):

F_m(x) = F_{m-1}(x) + \nu h_m(x)

The learning rate $\nu$ (typically a small value between 0.01 and 0.3) scales the contribution of each new tree. This reduces the impact of individual trees and helps prevent overfitting, forcing the model to build its prediction more gradually over many iterations. It effectively provides regularization.

The Generic GBM Algorithm Summarized

We can now outline the steps of the generic Gradient Boosting algorithm:

Initialize Model: Start with an initial constant prediction $F_0(x) = \arg \min_{\gamma} \sum_{i=1}^{N} L(y_i, \gamma)$ .
- For regression with squared error loss, $F_0(x)$ is the mean of $y$ .
- For binary classification with log loss, $F_0(x)$ is the log-odds corresponding to the overall probability.
Iterate for $m = 1$ to $M$ (Number of Trees): a. Compute Pseudo-Residuals: For each sample $i = 1, ..., N$ : $r_{im} = - \left[ \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \right]_{F(x) = F_{m-1}(x)}$ b. Fit Base Learner: Fit a base learner (e.g., a regression tree) $h_m(x)$ to the pseudo-residuals $\{ (x_i, r_{im}) \}_{i=1}^N$ . This determines the regions (leaves) $R_{jm}$ of the tree. c. Compute Optimal Leaf Values: For each terminal node (leaf) $j = 1, ..., J_m$ of the tree $h_m(x)$ : $\gamma_{jm} = \arg \min_{\gamma} \sum_{x_i \in R_{jm}} L(y_i, F_{m-1}(x_i) + \gamma)$ d. Update Model: Update the ensemble model using the learning rate $\nu$ : $F_m(x) = F_{m-1}(x) + \nu \sum_{j=1}^{J_m} \gamma_{jm} I(x \in R_{jm})$ (Where $I(\cdot)$ is the indicator function. Effectively, $h_m(x)$ now represents the tree with optimal leaf values $\gamma_{jm}$ .)
Output Final Model: The final prediction is given by $F_M(x)$ . For classification, this might be converted to probabilities (e.g., using the sigmoid function for binary classification).

Iterative process of the Gradient Boosting Machine (GBM) algorithm. Each iteration involves computing gradients (pseudo-residuals), fitting a base learner to these gradients, optimizing the learner's contribution, and updating the ensemble model.

This step-by-step derivation shows how GBM iteratively refines its predictions by sequentially adding base learners trained to correct the errors (as defined by the negative gradient of the loss function) of the existing ensemble. The choice of loss function $L$ dictates the specific form of the pseudo-residuals and potentially the leaf optimization step, allowing GBM to be adapted to various regression and classification tasks, which we will explore next.

Was this section helpful?

References

Greedy Function Approximation: A Gradient Boosting Machine, Jerome H. Friedman, 2001 The Annals of Statistics, Vol. 29 (Institute of Mathematical Statistics) DOI: 10.1214/aos/1013203451 - This foundational paper introduces the Gradient Boosting Machine (GBM) algorithm, detailing its mathematical derivation and the general framework of functional gradient descent.
The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Trevor Hastie, Robert Tibshirani, Jerome Friedman, 2009 (Springer) - A comprehensive textbook providing an in-depth statistical explanation of various machine learning algorithms, including a detailed chapter on gradient boosting and its theoretical underpinnings. (2nd edition)
Lecture Notes: Boosting (CS229, Machine Learning), Andrew Ng, 2009 (Stanford University) - These lecture notes provide a clear and concise explanation of boosting algorithms, including a pedagogical derivation of Gradient Boosting, suitable for an introductory but rigorous understanding.