Gradient Boosting Machines (GBMs) iteratively add base learners, such as decision trees, to minimize a loss function L(y,F(x)). In this context, y represents the true target and F(x) signifies the current ensemble prediction. This minimization process is conceptualized as functional gradient descent, where each new tree hm(x) is trained to approximate the negative gradient of the loss function with respect to the current prediction Fm−1(x).
The specific form of these negative gradients, often called pseudo-residuals, depends directly on the chosen loss function. For regression problems, where the goal is to predict a continuous value, several common loss functions are used, each influencing the model's behavior, particularly its sensitivity to outliers. Let's examine the most frequent ones.
Squared Error (L2 Loss)
The most standard loss function for regression is the Squared Error, also known as L2 loss. It measures the squared difference between the true value yi and the predicted value F(xi). For a single data point (xi,yi), it's defined as:
L(yi,F(xi))=21(yi−F(xi))2
The factor of 21 is included for mathematical convenience, simplifying the derivative.
To find the pseudo-residual rim for the m-th iteration, we compute the negative gradient of this loss function with respect to the prediction F(xi) evaluated at the previous step's prediction Fm−1(xi):
rim=−[∂F(xi)∂L(yi,F(xi))]F(xi)=Fm−1(xi)=−[−(yi−F(xi))]F(xi)=Fm−1(xi)=yi−Fm−1(xi)
This result is quite intuitive. When using squared error loss, the pseudo-residual is simply the actual residual of the current model Fm−1(xi). Therefore, each new tree hm(x) added to the ensemble is trained to predict the errors made by the ensemble so far.
Properties:
- Focus on Mean: Minimizing squared error corresponds to modeling the conditional mean of the target variable.
- Sensitivity to Outliers: Because the error is squared, large errors (outliers) have a disproportionately large influence on the loss and consequently on the gradients. A single outlier can significantly impact the training of subsequent trees.
- Smoothness: The loss function is smooth and continuously differentiable, making optimization straightforward.
Absolute Error (L1 Loss)
An alternative is the Absolute Error, or L1 loss, which measures the absolute difference between the true and predicted values:
L(yi,F(xi))=∣yi−F(xi)∣
The negative gradient for this loss function is:
rim=−[∂F(xi)∂L(yi,F(xi))]F(xi)=Fm−1(xi)=−[−sign(yi−F(xi))]F(xi)=Fm−1(xi)=sign(yi−Fm−1(xi))
Here, sign(z) is 1 if z>0, -1 if z<0, and 0 if z=0.
When using absolute error loss, the pseudo-residuals are simply the sign of the actual residuals. Each new tree hm(x) is trained to predict whether the previous ensemble's prediction was too high (+1) or too low (-1) for each data point.
Properties:
- Focus on Median: Minimizing absolute error corresponds to modeling the conditional median of the target variable.
- Robustness to Outliers: Since the error is not squared, large errors contribute linearly to the total loss. This makes the L1 loss much less sensitive to outliers compared to L2 loss.
- Non-Smoothness: The loss function has a discontinuity in its derivative at yi=F(xi) (where the residual is zero). While technically requiring subgradient methods, implementations often handle this by assigning a gradient of 0 or using approximations.
Huber Loss
Huber loss provides a compromise between the sensitivity of Squared Error and the robustness of Absolute Error. It behaves quadratically for small errors and linearly for large errors. It introduces a hyperparameter, δ, which defines the threshold where the behavior changes.
Lδ(yi,F(xi))={21(yi−F(xi))2δ(∣yi−F(xi)∣−21δ)for ∣yi−F(xi)∣≤δotherwise
The term −21δ2 in the linear part ensures the function is continuously differentiable at the points where ∣yi−F(xi)∣=δ.
The corresponding negative gradient (pseudo-residual) is:
rim=−[∂F(xi)∂Lδ(yi,F(xi))]F(xi)=Fm−1(xi)={yi−Fm−1(xi)δ⋅sign(yi−Fm−1(xi))for ∣yi−Fm−1(xi)∣≤δotherwise
For errors smaller than δ, the pseudo-residual is the actual residual (like L2 loss). For errors larger than δ, the pseudo-residual is capped at ±δ (similar to L1 loss, but scaled by δ).
Properties:
- Hybrid Behavior: Combines the good properties of both L2 (smoothness near the minimum) and L1 (robustness to outliers).
- Tunable Parameter: Requires tuning the δ parameter. A smaller δ makes the loss function behave more like L1 loss, increasing robustness but potentially slowing convergence. A larger δ makes it behave more like L2 loss. δ effectively defines which points are considered outliers.
Comparing Regression Loss Functions
The choice of loss function is an important modeling decision that depends on the specific characteristics of the data and the desired properties of the model.
- Use Squared Error (L2) if your data is relatively clean without significant outliers, or if predicting the mean is the primary goal. It's often the default and computationally efficient.
- Use Absolute Error (L1) if your dataset contains significant outliers and you want a model strong to their influence. This focuses the model on predicting the median.
- Use Huber Loss when you want a balance between L2 and L1, providing robustness to outliers while maintaining smoothness for smaller errors. This requires tuning the δ parameter, often via cross-validation.
The following plot illustrates the shape of these three loss functions based on the residual (y−F(x)). We use δ=1 for the Huber loss in this example.
Comparison of L2 Loss, L1 Loss, and Huber Loss (δ=1). Note the quadratic growth of L2, linear growth of L1, and the transition of Huber from quadratic to linear.
Understanding how these loss functions translate into pseudo-residuals is fundamental to grasping how GBMs learn. The choice impacts which aspects of the error the model focuses on correcting at each iteration. In practice, libraries like Scikit-learn allow you to specify the desired loss function easily via parameters (e.g., the loss parameter in GradientBoostingRegressor which accepts values like 'squared_error', 'absolute_error', 'huber').