In the previous section, we established that Gradient Boosting Machines (GBMs) iteratively add base learners (typically decision trees) to minimize a loss function L(y,F(x)), where y is the true target and F(x) is the current ensemble prediction. This minimization is framed as functional gradient descent, where each new tree hm(x) is trained to approximate the negative gradient of the loss function with respect to the current prediction Fm−1(x).
The specific form of these negative gradients, often called pseudo-residuals, depends directly on the chosen loss function. For regression problems, where the goal is to predict a continuous value, several common loss functions are used, each influencing the model's behavior, particularly its sensitivity to outliers. Let's examine the most frequent ones.
The most standard loss function for regression is the Squared Error, also known as L2 loss. It measures the squared difference between the true value yi and the predicted value F(xi). For a single data point (xi,yi), it's defined as:
L(yi,F(xi))=21(yi−F(xi))2The factor of 21 is included for mathematical convenience, simplifying the derivative.
To find the pseudo-residual rim for the m-th iteration, we compute the negative gradient of this loss function with respect to the prediction F(xi) evaluated at the previous step's prediction Fm−1(xi):
rim=−[∂F(xi)∂L(yi,F(xi))]F(xi)=Fm−1(xi)=−[−(yi−F(xi))]F(xi)=Fm−1(xi)=yi−Fm−1(xi)This result is quite intuitive. When using squared error loss, the pseudo-residual is simply the actual residual of the current model Fm−1(xi). Therefore, each new tree hm(x) added to the ensemble is trained to predict the errors made by the ensemble so far.
Properties:
An alternative is the Absolute Error, or L1 loss, which measures the absolute difference between the true and predicted values:
L(yi,F(xi))=∣yi−F(xi)∣The negative gradient for this loss function is:
rim=−[∂F(xi)∂L(yi,F(xi))]F(xi)=Fm−1(xi)=−[−sign(yi−F(xi))]F(xi)=Fm−1(xi)=sign(yi−Fm−1(xi))Here, sign(z) is 1 if z>0, -1 if z<0, and 0 if z=0.
When using absolute error loss, the pseudo-residuals are simply the sign of the actual residuals. Each new tree hm(x) is trained to predict whether the previous ensemble's prediction was too high (+1) or too low (-1) for each data point.
Properties:
Huber loss provides a compromise between the sensitivity of Squared Error and the robustness of Absolute Error. It behaves quadratically for small errors and linearly for large errors. It introduces a hyperparameter, δ, which defines the threshold where the behavior changes.
Lδ(yi,F(xi))={21(yi−F(xi))2δ(∣yi−F(xi)∣−21δ)for ∣yi−F(xi)∣≤δotherwiseThe term −21δ2 in the linear part ensures the function is continuously differentiable at the points where ∣yi−F(xi)∣=δ.
The corresponding negative gradient (pseudo-residual) is:
rim=−[∂F(xi)∂Lδ(yi,F(xi))]F(xi)=Fm−1(xi)={yi−Fm−1(xi)δ⋅sign(yi−Fm−1(xi))for ∣yi−Fm−1(xi)∣≤δotherwiseFor errors smaller than δ, the pseudo-residual is the actual residual (like L2 loss). For errors larger than δ, the pseudo-residual is capped at ±δ (similar to L1 loss, but scaled by δ).
Properties:
The choice of loss function is an important modeling decision that depends on the specific characteristics of the data and the desired properties of the model.
The following plot illustrates the shape of these three loss functions based on the residual (y−F(x)). We use δ=1 for the Huber loss in this example.
Comparison of L2 Loss, L1 Loss, and Huber Loss (δ=1). Note the quadratic growth of L2, linear growth of L1, and the transition of Huber from quadratic to linear.
Understanding how these loss functions translate into pseudo-residuals is fundamental to grasping how GBMs learn. The choice impacts which aspects of the error the model focuses on correcting at each iteration. In practice, libraries like Scikit-learn allow you to specify the desired loss function easily via parameters (e.g., the loss
parameter in GradientBoostingRegressor
which accepts values like 'squared_error'
, 'absolute_error'
, 'huber'
).
© 2025 ApX Machine Learning