Having revisited the concepts of ensemble methods, the role of decision trees, additive modeling, and gradient descent, we can now synthesize these ideas into the original Gradient Boosting Machine (GBM) algorithm, as proposed by Jerome Friedman. GBM provides a specific and powerful framework for constructing additive models iteratively.
At its heart, GBM is an application of gradient descent optimization, but instead of optimizing parameters in a fixed model structure, it optimizes in the space of functions. Imagine you have an existing model Fm−1(x) built after m−1 iterations. To improve it, we want to add a new function (a base learner, typically a decision tree) hm(x) such that the new model Fm(x)=Fm−1(x)+hm(x) minimizes the overall loss L(y,F(x)).
The core insight of GBM is to choose the new function hm(x) to point in the negative gradient direction of the loss function L with respect to the current model's predictions Fm−1(x). For a given observation (xi,yi), the negative gradient, often called the pseudo-residual, is calculated as:
rim=−[∂F(xi)∂L(yi,F(xi))]F(x)=Fm−1(x)This rim represents the "direction" in which the prediction F(xi) should move to best decrease the loss for observation i. The algorithm then fits a base learner hm(x) (e.g., a CART regression tree) to predict these pseudo-residuals based on the input features xi. In essence, the new tree learns the errors made by the current ensemble Fm−1(x), framed in terms of the loss function's gradient.
The overall GBM algorithm proceeds as follows:
The iterative process of the Gradient Boosting Machine algorithm.
The flexibility of GBM comes from the choice of the loss function L and the type of base learner hm. While other learners are possible, regression trees are overwhelmingly used because they naturally handle the task of predicting the pseudo-residuals (which are continuous values) and can capture complex non-linearities and interactions in the data. The structure of the tree itself (splits, depths) is determined during the fitting process in step 2b.
Common loss functions include Mean Squared Error (L(y,F)=21(y−F)2) for regression, leading to standard residuals (rim=yi−Fm−1(xi)), and Log Loss (Deviance) for binary classification. The choice of loss function directly impacts the calculation of the pseudo-residuals and shapes how the model prioritizes correcting different types of errors.
This foundational GBM algorithm provides a powerful and general approach to boosting. However, it has limitations, particularly regarding computational efficiency, regularization, and handling specific data types. These limitations motivated the development of the advanced algorithms like XGBoost, LightGBM, and CatBoost, which we will explore in detail in the subsequent chapters. They build upon this core GBM framework by introducing sophisticated regularization techniques, optimized tree-building algorithms, and specialized handling for various data characteristics.
© 2025 ApX Machine Learning