For regression problems, Gradient Boosting typically fits new trees to the residual errors. However, applying this technique to classification tasks requires a different approach. Classification tasks involve target variables that are class labels (e.g., 0 or 1), making it impossible to simply calculate residuals like . Furthermore, the model's raw output is a continuous score, not a class label, which necessitates conversion into a probability.
To solve this, we adapt the framework to work with probabilities. The core idea is to have the model predict the log-odds of a positive class, a value that ranges from negative to positive infinity. We can then transform this log-odds score into a probability between 0 and 1 using the logistic (or sigmoid) function.
This setup allows us to use a more appropriate loss function for classification, the Log Loss, also known as Binary Cross-Entropy.
For a binary classification problem, where the true label is either 0 or 1, the Log Loss for a single observation is defined as:
Here, is the predicted probability of the positive class ().
Let's analyze this function:
The function penalizes confident but incorrect predictions much more heavily than it rewards confident and correct ones, which is a desirable property for a classification loss function.
The Log Loss function penalizes predictions that are confidently wrong. When the true label is 1 (blue line), the loss approaches infinity as the predicted probability approaches 0. Similarly, when the true label is 0 (red line), the loss grows as the probability approaches 1.
As we learned, Gradient Boosting trains new models on the negative gradient of the loss function. For regression with MSE, this gradient was simply the residual error. Let's find the equivalent for classification using Log Loss.
First, let be the raw output of our current ensemble model for an observation . This output is in log-odds space. We convert it to a probability using the logistic function:
Now, we need to find the derivative of the Log Loss function with respect to the model's raw output . Using the chain rule, this is:
After working through the derivatives (a common exercise in machine learning courses), we arrive at a remarkably simple result:
The negative gradient, which our next tree will be trained on, is therefore:
This result, , is the pseudo-residual for classification. It is the difference between the actual label (0 or 1) and the model's current predicted probability. For example, if the true label is 1 and the model predicts a probability of 0.3, the pseudo-residual is . The next tree will be trained to predict this value, pushing the overall model's prediction closer to the correct answer.
This elegant outcome shows the power of the gradient boosting framework. By choosing the right loss function, we derive a "residual-like" quantity that allows the same sequential, error-correcting process to work for classification just as it did for regression.
The framework extends directly to multiclass classification problems. Instead of a single log-odds value, the model produces a vector of log-odds, one for each class. These are then converted into a probability distribution using the softmax function.
The corresponding loss function is Multinomial Deviance, often called Categorical Cross-Entropy. The process remains the same: calculate the negative gradient of this loss function for each class, resulting in a set of pseudo-residuals. A separate tree is typically trained for each class at each boosting iteration to predict these pseudo-residuals, updating the model's log-odds scores for every class. While the implementation details are more involved, the underlying mechanism of fitting trees to gradients remains identical.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•