Okay, we've seen that neural networks are composed of layers of interconnected neurons, using activation functions to introduce non-linearity. But how does a network actually learn? How does it go from random initial guesses to making useful predictions? The process starts with quantifying how wrong its predictions are. This is where loss functions come in.
Imagine you're learning to play darts. You throw a dart (the network makes a prediction), and it lands somewhere on the board. A loss function is like a scoring rule that tells you how far your dart (y^, the prediction) is from the bullseye (y, the actual target value). The goal, naturally, is to minimize this "distance" or error score over many throws.
In deep learning, a loss function, also known as a cost function or objective function, computes a single scalar value representing the discrepancy between the network's predicted outputs (y^) and the true target values (y) for a given set of input data. The entire training process revolves around minimizing this loss value. By systematically adjusting the network's weights and biases (which we'll cover with gradient descent and backpropagation shortly), we try to find the parameter settings that yield the lowest possible loss, meaning the predictions are as close to the actual values as possible according to our chosen metric.
The choice of loss function is not arbitrary; it depends fundamentally on the type of problem you are trying to solve, primarily whether it's a regression or a classification task.
Regression problems involve predicting continuous numerical values. Examples include predicting house prices, stock values, or temperature. For these tasks, the loss function measures the difference between the predicted number and the actual number.
Mean Squared Error is perhaps the most common loss function for regression. It calculates the average of the squared differences between the predicted values and the actual values.
The formula for MSE is: MSE=n1∑i=1n(yi−y^i)2 where:
Why square the difference? Squaring serves two purposes:
This strong penalty for large errors makes MSE sensitive to outliers. If your dataset contains a few points with unusually large errors, they can dominate the loss value and significantly influence the resulting model parameters during training. However, MSE is mathematically convenient, particularly because its derivative is easy to compute, which is helpful for gradient-based optimization methods.
Mean Absolute Error provides an alternative perspective on regression loss. Instead of squaring the differences, it takes their absolute value.
The formula for MAE is: MAE=n1∑i=1n∣yi−y^i∣ Here, the error contribution scales linearly with the difference. A difference of 10 contributes 10 to the sum, and a difference of 2 contributes 2.
Because MAE doesn't square the errors, it is less sensitive to outliers compared to MSE. A single large error won't dominate the total loss quite as much. This makes MAE a potentially better choice if your dataset contains significant outliers that you don't want to overly influence the model. The main drawback is that the gradient of the absolute value function is undefined at zero and constant elsewhere, which can sometimes make optimization slightly less straightforward than with the smooth gradient of MSE, especially near the minimum.
The plot below shows how the loss contribution increases with the absolute error (∣y−y^∣) for both MSE and MAE. Notice how MSE (blue line) curves upwards much faster than MAE (orange line), reflecting its stronger penalization of larger errors.
Comparison of loss values for Mean Squared Error (MSE) and Mean Absolute Error (MAE) as the prediction error increases. MSE grows quadratically, while MAE grows linearly.
Classification problems involve assigning input data to one of several discrete categories or classes. Examples include identifying spam emails (spam/not spam), recognizing handwritten digits (0-9), or classifying images (cat/dog/bird).
For classification, we typically work with probabilities. The network outputs a probability distribution over the possible classes (e.g., 80% chance it's a cat, 15% dog, 5% bird). We need a loss function that measures how well the predicted probability distribution matches the actual distribution (where the true class has 100% probability and others have 0%). MSE and MAE are generally not suitable here because they don't effectively capture the notion of "distance" between probability distributions. The standard choice is Cross-Entropy Loss.
Cross-Entropy originates from information theory and provides a way to measure the difference between two probability distributions. In our context, these are the predicted probability distribution (from the network's output layer, often after a Sigmoid or Softmax activation) and the true probability distribution (representing the actual label). A lower cross-entropy value indicates that the predicted distribution is closer to the true distribution.
The specific form of cross-entropy loss depends on the number of classes.
Used for binary classification tasks where there are only two possible outcome classes (e.g., 0 or 1, True or False, Spam or Not Spam). This loss function is typically used in conjunction with a Sigmoid activation function in the output layer, as Sigmoid squashes the output to a value between 0 and 1, representing the probability of the positive class (class 1).
The formula for Binary Cross-Entropy (also called Log Loss) is: BCE Loss=−n1∑i=1n[yilog(y^i)+(1−yi)log(1−y^i)] where:
Let's break this down for a single sample (n=1):
BCE effectively penalizes predictions that are confidently wrong.
Used for multi-class classification tasks where each sample belongs to one of C possible classes (C>2). This loss function is typically paired with a Softmax activation function in the output layer. Softmax converts the network's raw output scores (logits) for each class into a probability distribution where all probabilities sum to 1.
The formula for Categorical Cross-Entropy is: CCE Loss=−n1∑i=1n∑j=1Cyijlog(y^ij) where:
[0, 0, 1, 0]
if the true class is the 3rd out of 4).For a single sample i, because only one yij value is 1 (for the true class k) and the rest are 0, the inner sum simplifies to just −log(y^ik). That is, the loss for that sample is simply the negative logarithm of the probability the network assigned to the correct class. The network is penalized based on how low the probability assigned to the correct class is.
Note: Sometimes you might encounter "Sparse Categorical Cross-Entropy". This is mathematically the same as CCE but accepts integer labels (e.g., 2
for the 3rd class) instead of requiring one-hot encoded labels, which can be more memory-efficient.
Selecting the appropriate loss function is fundamental for successful model training. The primary guideline is the nature of your task:
Now that we have a way to precisely measure how well our network is performing using a loss function, the next step is to figure out how to adjust the network's parameters (weights and biases) to minimize this loss. This optimization process is the subject of the rest of this chapter, starting with the core algorithm: Gradient Descent.
© 2025 ApX Machine Learning