The flexibility of the Gradient Boosting Machine comes from its ability to optimize any differentiable loss function. For regression problems, this means we can choose a function that best reflects our performance objective and the nature of our data's errors. The loss function directly guides the algorithm by defining the "mistakes" that each subsequent tree should try to correct.
The most common choice for regression, and often the default in many libraries, is the Mean Squared Error (MSE), also known as L2 loss. It measures the average of the squares of the errors. For a single prediction, the loss is calculated as:
The factor of is included for mathematical convenience, as it simplifies the derivative.
The primary characteristic of MSE is that it penalizes larger errors much more severely than smaller ones. An error of 4 is penalized 16 times more than an error of 1. This property makes the model focus intensely on reducing large errors, which is often desirable.
The connection to the core GBM algorithm becomes clear when we find the negative gradient of this loss function with respect to the prediction :
This reveals a wonderfully intuitive result: when using MSE, the negative gradient is exactly the residual error. This is why the standard GBM for regression is often described as an algorithm where each new tree is trained to predict the residuals of the preceding model's predictions.
However, MSE's sensitivity to large errors can also be a disadvantage. If your dataset contains significant outliers, the model might dedicate too much of its capacity to correcting for these few anomalous points, potentially at the expense of its performance on the rest of the data.
As an alternative, we can use the Mean Absolute Error (MAE), or L1 loss. This function measures the average of the absolute differences between the true values and the predictions.
Unlike MSE, MAE penalizes errors linearly. An error of 4 is penalized only 4 times more than an error of 1. This makes models trained with MAE more resistant to outliers, as a single large error will not dominate the gradient calculation.
The negative gradient for MAE is the sign of the residual:
The gradient is either +1 or -1 (depending on whether the prediction was too low or too high), indicating the direction of the error but not its magnitude. This prevents outliers from having an outsized influence on the training of subsequent weak learners.
Huber loss provides a compromise between the sensitivity of MSE and the robustness of MAE. It behaves quadratically (like MSE) for small errors and linearly (like MAE) for large errors. This allows it to be less sensitive to outliers while still finely tuning predictions close to the true value.
The function is defined piecewise:
The parameter (delta) is a threshold that you can tune. It defines the point at which the loss function transitions from quadratic to linear. Errors smaller than are minimized with a squared term, while larger errors are minimized with an absolute term.
The behavior of different loss functions as the prediction error increases. MSE grows quadratically, heavily penalizing large errors. MAE grows linearly, making it less sensitive to outliers. Huber Loss combines these properties, acting quadratically for small errors and linearly for large ones.
Your choice of loss function should be guided by your dataset and your modeling goals.
Ultimately, the loss function is another tool at your disposal for building a model that aligns with the specific characteristics of your data. Experimenting with different loss functions can sometimes lead to meaningful improvements in model performance.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with