While Mean Squared Error (MSE) and Mean Absolute Error (MAE) work well for regression problems where the goal is to predict a continuous value, they aren't the ideal choice for classification tasks. In classification, we're predicting discrete categories (e.g., 'cat' vs. 'dog', 'spam' vs. 'not spam', digit '0' through '9'). The output of our network for a classification problem is typically interpreted as a probability distribution over the possible classes. For instance, for a digit classifier, the output might be [0.05, 0.01, 0.8, 0.04, ..., 0.1]
, indicating an 80% probability that the input image is the digit '2'.
Measuring the error for classification involves comparing the predicted probability distribution to the true distribution. The true distribution is usually represented as a "one-hot" vector, where the correct class has a probability of 1 and all other classes have a probability of 0. For example, if the true digit is '2', the true distribution is [0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
.
Cross-entropy is the standard loss function for classification problems because it measures the dissimilarity between two probability distributions. A lower cross-entropy value signifies that the predicted distribution is closer to the true distribution.
There are two main variants of cross-entropy loss used in practice: Binary Cross-Entropy and Categorical Cross-Entropy.
Binary Cross-Entropy (also known as Log Loss) is used for binary classification tasks, where there are only two possible output classes (e.g., 0 or 1, True or False, Spam or Not Spam).
Typically, in a binary classification network, the output layer consists of a single neuron with a Sigmoid activation function. The Sigmoid function squashes the output to a value between 0 and 1, which can be interpreted as the probability of the input belonging to the positive class (class 1). Let's call this predicted probability p. The probability of the input belonging to the negative class (class 0) is then 1−p.
Let y be the true label, where y=1 for the positive class and y=0 for the negative class. The Binary Cross-Entropy loss for a single training example is calculated as:
LBCE=−[ylog(p)+(1−y)log(1−p)]Let's break this down:
This formula effectively penalizes the model more heavily for confident wrong predictions.
The loss increases sharply as the predicted probability for the correct positive class approaches 0. A similar curve applies for −log(1−p) when the true class is 0.
In PyTorch, you typically use torch.nn.BCELoss
or, more commonly, torch.nn.BCEWithLogitsLoss
. The latter is numerically more stable as it combines the Sigmoid layer and the BCE loss calculation in one step, applying mathematical tricks to avoid potential floating-point issues when probabilities get very close to 0 or 1.
import torch
import torch.nn as nn
# Example setup for Binary Classification
# Assume model output is raw logits (before Sigmoid)
model_output = torch.randn(5, 1) # 5 examples, 1 output logit each
true_labels = torch.randint(0, 2, (5, 1)).float() # 5 labels (0 or 1)
# Use BCEWithLogitsLoss (recommended)
loss_fn = nn.BCEWithLogitsLoss()
loss = loss_fn(model_output, true_labels)
print(f"BCE With Logits Loss: {loss.item()}")
# Alternatively, apply Sigmoid first then use BCELoss (less stable)
# predicted_probs = torch.sigmoid(model_output)
# loss_fn_bce = nn.BCELoss()
# loss_bce = loss_fn_bce(predicted_probs, true_labels)
# print(f"BCE Loss: {loss_bce.item()}")
Categorical Cross-Entropy is used for multi-class classification tasks, where there are more than two possible output classes (e.g., classifying handwritten digits 0-9, identifying different types of objects in an image).
In multi-class classification, the network's final layer typically has one neuron for each class, and a Softmax activation function is applied. Softmax converts the raw outputs (logits) for each class into a probability distribution, where each probability is between 0 and 1, and all probabilities sum to 1.
Let C be the number of classes. The network outputs a vector of predicted probabilities p=[p1,p2,...,pC], where pi is the predicted probability for class i. The true label is usually represented as a one-hot encoded vector y=[y1,y2,...,yC], where yi=1 if i is the true class, and yi=0 otherwise.
The Categorical Cross-Entropy loss for a single training example is calculated as:
LCCE=−i=1∑Cyilog(pi)Since only one element in the true label vector y is 1 (let's say yk=1 for the true class k) and all others are 0, the sum simplifies to:
LCCE=−yklog(pk)=−log(pk)The loss is simply the negative logarithm of the predicted probability for the correct class. To minimize the loss, the network must learn to assign the highest possible probability to the true class for each input example.
In PyTorch, the standard way to implement this is using torch.nn.CrossEntropyLoss
. This module conveniently combines the Softmax activation and the Categorical Cross-Entropy calculation (specifically, it computes LogSoftmax
followed by NLLLoss
- Negative Log Likelihood Loss, which is equivalent but often more numerically stable). It expects the raw outputs (logits) from the model and the true labels as class indices (e.g., 0, 1, 2, ...) rather than one-hot encoded vectors.
import torch
import torch.nn as nn
# Example setup for Multi-class Classification
# Assume model output is raw logits (before Softmax)
# 5 examples, 10 classes (e.g., MNIST digits)
model_output_logits = torch.randn(5, 10)
# True labels as class indices (0 to 9)
true_labels_indices = torch.randint(0, 10, (5,))
# Use CrossEntropyLoss (combines Softmax and NLLLoss)
loss_fn = nn.CrossEntropyLoss()
loss = loss_fn(model_output_logits, true_labels_indices)
print(f"Categorical Cross Entropy Loss: {loss.item()}")
# Manual calculation for illustration (less stable):
# softmax = nn.Softmax(dim=1)
# predicted_probs = softmax(model_output_logits)
# nll_loss_fn = nn.NLLLoss()
# # NLLLoss expects log-probabilities
# loss_nll = nll_loss_fn(torch.log(predicted_probs + 1e-9), true_labels_indices) # Add small epsilon for stability
# print(f"Manual NLL Loss (approx): {loss_nll.item()}")
Using cross-entropy aligns the loss function directly with the goal of classification: maximizing the probability assigned to the correct class. Its mathematical properties also generally lead to more stable and efficient training compared to using regression losses like MSE for classification problems. Understanding how to choose and apply the correct cross-entropy variant is a fundamental aspect of training effective classification models.
© 2025 ApX Machine Learning