The AdaBoost algorithm focuses on re-weighting data points to force subsequent models to pay more attention to the ones that were previously misclassified. Gradient Boosting takes a different and more direct approach to correcting errors. It reframes the boosting problem as an optimization task where the goal is to minimize a specified loss function. The mechanism for this optimization is gradient descent, but with a twist. Instead of updating parameters in a single complex model, we are sequentially adding simple models (weak learners) that move our total prediction in the right direction.
The "right direction" is important. In optimization, the steepest and most direct path toward a minimum is the direction of the negative gradient. Gradient Boosting applies this principle by training each new weak learner to predict the negative gradient of the loss function, calculated with respect to the predictions of the existing ensemble.
This might sound abstract, so let's make it concrete with the most common loss function for regression: Mean Squared Error (MSE). The MSE loss for a single observation is defined as:
Here, is the true value and is our current model's prediction. In the context of Gradient Boosting, is the prediction from the current ensemble of trees.
To improve our model, we need to know how to adjust the prediction to reduce the loss. We find this direction by calculating the gradient (the derivative, in this single-variable case) of the loss function with respect to the prediction :
The gradient tells us the direction of the steepest ascent. To minimize the loss, we must move in the opposite direction, which is the negative gradient:
This result is remarkably simple and intuitive. The negative gradient of the MSE loss function is simply the residual error, the difference between the true value and the current prediction. This means that for a regression problem with MSE, each new tree is trained to predict the errors made by all the preceding trees. The algorithm is literally chasing its own mistakes, fitting a model to the remaining error at each step.
The iterative process in Gradient Boosting for regression with MSE. Each new learner (h) is trained on the residual errors (r) of the previous ensemble's prediction (F), and the ensemble is updated.
The real power of Gradient Boosting comes from the fact that this process works for any differentiable loss function, not just MSE. For other loss functions, the negative gradient might not be the simple residual, but it serves the exact same purpose. It represents the direction and magnitude of the error for each data point that the next tree should try to correct.
For this reason, the targets that we calculate at each step (the negative gradients) are often called pseudo-residuals.
Consider a few examples:
sign(y - ŷ). The pseudo-residuals will be either +1 or -1, directing the next model to increase or decrease its prediction.By framing the problem as minimizing a loss function via gradient descent, Gradient Boosting becomes a highly flexible framework. You can choose a loss function that accurately reflects your specific problem's objective, and the algorithm's mechanics remain the same. The core task at each iteration is always to calculate the pseudo-residuals (the negative gradients) and fit a new weak learner to them. This generalization is what elevates Gradient Boosting from a clever trick for regression into a versatile machine learning powerhouse.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with