Loss Functions for Classification Tasks

For regression problems, Gradient Boosting typically fits new trees to the residual errors. However, applying this technique to classification tasks requires a different approach. Classification tasks involve target variables that are class labels (e.g., 0 or 1), making it impossible to simply calculate residuals like $y - \hat{y}$ . Furthermore, the model's raw output is a continuous score, not a class label, which necessitates conversion into a probability.

To solve this, we adapt the framework to work with probabilities. The core idea is to have the model predict the log-odds of a positive class, a value that ranges from negative to positive infinity. We can then transform this log-odds score into a probability between 0 and 1 using the logistic (or sigmoid) function.

This setup allows us to use a more appropriate loss function for classification, the Log Loss, also known as Binary Cross-Entropy.

The Log Loss Function

For a binary classification problem, where the true label $y$ is either 0 or 1, the Log Loss for a single observation is defined as:

L(y, p) = -[y \log(p) + (1 - y) \log(1 - p)]

Here, $p$ is the predicted probability of the positive class ( $y=1$ ).

Let's analyze this function:

If the true label $y=1$ , the loss becomes $-\log(p)$ . The loss is small when the predicted probability $p$ is close to 1, and it grows infinitely large as $p$ approaches 0.
If the true label $y=0$ , the loss becomes $-\log(1-p)$ . The loss is small when $p$ is close to 0 (meaning $1-p$ is close to 1), and it grows infinitely large as $p$ approaches 1.

The function penalizes confident but incorrect predictions much more heavily than it rewards confident and correct ones, which is a desirable property for a classification loss function.

The Log Loss function penalizes predictions that are confidently wrong. When the true label is 1 (blue line), the loss approaches infinity as the predicted probability approaches 0. Similarly, when the true label is 0 (red line), the loss grows as the probability approaches 1.

From Loss to Gradients

As we learned, Gradient Boosting trains new models on the negative gradient of the loss function. For regression with MSE, this gradient was simply the residual error. Let's find the equivalent for classification using Log Loss.

First, let $F(x)$ be the raw output of our current ensemble model for an observation $x$ . This output is in log-odds space. We convert it to a probability $p$ using the logistic function:

p = \frac{1}{1 + e^{-F(x)}}

Now, we need to find the derivative of the Log Loss function $L(y, p)$ with respect to the model's raw output $F(x)$ . Using the chain rule, this is:

\frac{\partial L}{\partial F(x)} = \frac{\partial L}{\partial p} \cdot \frac{\partial p}{\partial F(x)}

After working through the derivatives (a common exercise in machine learning courses), we arrive at a remarkably simple result:

\frac{\partial L}{\partial F(x)} = p - y

The negative gradient, which our next tree will be trained on, is therefore:

-\frac{\partial L}{\partial F(x)} = - (p - y) = y - p

This result, $y - p$ , is the pseudo-residual for classification. It is the difference between the actual label (0 or 1) and the model's current predicted probability. For example, if the true label is 1 and the model predicts a probability of 0.3, the pseudo-residual is $1 - 0.3 = 0.7$ . The next tree will be trained to predict this value, pushing the overall model's prediction closer to the correct answer.

This elegant outcome shows the power of the gradient boosting framework. By choosing the right loss function, we derive a "residual-like" quantity that allows the same sequential, error-correcting process to work for classification just as it did for regression.

Multiclass Classification

The framework extends directly to multiclass classification problems. Instead of a single log-odds value, the model produces a vector of log-odds, one for each class. These are then converted into a probability distribution using the softmax function.

The corresponding loss function is Multinomial Deviance, often called Categorical Cross-Entropy. The process remains the same: calculate the negative gradient of this loss function for each class, resulting in a set of pseudo-residuals. A separate tree is typically trained for each class at each boosting iteration to predict these pseudo-residuals, updating the model's log-odds scores for every class. While the implementation details are more involved, the underlying mechanism of fitting trees to gradients remains identical.

Was this section helpful?

References

Greedy Function Approximation: A Gradient Boosting Machine, Jerome H. Friedman, 2001 The Annals of Statistics, Vol. 29 (Institute of Mathematical Statistics) DOI: 10.1214/aos/1009210624 - Introduces the Gradient Boosting Machine framework, detailing its general principles and extending its application to classification loss functions.
The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Trevor Hastie, Robert Tibshirani, and Jerome Friedman, 2009 (Springer) - A textbook covering statistical learning methods, including Gradient Boosting, logistic regression, cross-entropy, and the derivation of gradients for various loss functions (2nd edition).
Pattern Recognition and Machine Learning, Christopher Bishop, 2006 (Springer) - Provides a detailed treatment of probabilistic machine learning, including logistic regression, softmax, and cross-entropy error functions.