Before a neural network can learn, we need a way to precisely measure how well it's performing on a given task. Is it predicting house prices accurately? Is it correctly classifying images of cats and dogs? This measurement is the role of the loss function, also sometimes called a cost function or objective function.
Think of the loss function as grading the network's predictions. It takes the network's output (the predictions, often denoted as y^) and compares them to the true target values (y) from our dataset. The result is a single scalar value representing the "error" or "loss" for those predictions. A perfect prediction would result in zero loss, while larger errors lead to higher loss values.
The goal of training is to minimize this loss value. The entire machinery of gradient descent and backpropagation, which we'll discuss next, is geared towards adjusting the network's weights and biases in a way that systematically reduces the output of the chosen loss function.
The specific loss function you choose is significant and depends heavily on the type of problem you're trying to solve, primarily whether it's a regression or a classification task.
Regression problems involve predicting continuous numerical values, like the price of a house, the temperature tomorrow, or the stock market value.
The most common loss function for regression is the Mean Squared Error (MSE). It calculates the average of the squared differences between the predicted values and the actual values.
For a dataset with N examples, the MSE is calculated as:
LMSE=N1i=1∑N(yi−y^i)2Where:
Why square the difference?
Consideration: Because MSE squares the error term, it can be sensitive to outliers. A single data point with a very large error can dominate the loss value and potentially skew the training process.
The relationship between prediction error and the contribution to MSE loss. Larger errors (positive or negative) result in quadratically increasing loss values.
An alternative for regression is the Mean Absolute Error (MAE). It calculates the average of the absolute differences between predictions and targets.
LMAE=N1i=1∑N∣yi−y^i∣MAE measures the average magnitude of the errors without considering their direction. Unlike MSE, it penalizes errors linearly. This makes MAE less sensitive to outliers compared to MSE, which might be beneficial in datasets with significant noise or anomalies. However, its gradient is constant, which can sometimes lead to issues with convergence when using gradient descent, especially when the loss is close to the minimum.
Classification problems involve predicting a discrete category or class label, such as identifying spam emails, classifying images, or predicting customer churn (yes/no).
When dealing with binary classification (two possible output classes, typically labeled 0 and 1), the standard loss function is Binary Cross-Entropy, often called Log Loss. It's typically used when the output layer of the network has a single neuron with a Sigmoid activation function, which squashes the output to a probability between 0 and 1.
The formula for Binary Cross-Entropy for N examples is:
LBCE=−N1i=1∑N[yilog(y^i)+(1−yi)log(1−y^i)]Where:
Let's break this down:
Binary Cross-Entropy heavily penalizes predictions that are confidently wrong.
Binary Cross-Entropy loss increases dramatically as the predicted probability moves away from the true label (0 or 1).
For multi-class classification problems (more than two classes), we use Categorical Cross-Entropy. This requires the true labels y to be in a one-hot encoded format (e.g., class 2 out of 3 classes would be [0, 0, 1]
). The network's output layer typically uses a Softmax activation function, which produces a probability distribution across all C classes, ensuring the predicted probabilities sum to 1.
The formula for Categorical Cross-Entropy for N examples and C classes is:
LCCE=−N1i=1∑Nj=1∑Cyijlog(y^ij)Where:
Because yij is 1 for only the single correct class and 0 for all others, the inner sum simplifies to just −log(y^ik) where k is the index of the true class for data point i. The loss function effectively measures how low the predicted probability is for the actual correct class. A higher predicted probability for the true class results in a lower loss.
Selecting the appropriate loss function is fundamental to successful model training:
The loss function provides the error signal that drives learning. By calculating how far the network's predictions deviate from the true targets, it sets the stage for the optimization process. The next step is understanding how algorithms like gradient descent use this error signal to adjust the network's parameters.
© 2025 ApX Machine Learning