Before a neural network can learn, it needs a way to measure how wrong its predictions are. This measurement is the job of the loss function (also known as the cost function or objective function). Think of it as a score that quantifies the difference between the model's predictions and the actual target values in the training data. The goal of training is to adjust the network's weights to make this score as low as possible.
During the training process, the loss function's value for the current batch of data is calculated after the forward pass (making predictions). This loss value is then used by the optimization algorithm (which we'll discuss next) during backpropagation to calculate the gradients, essentially determining the direction and magnitude of weight adjustments needed to reduce the error. A well-chosen loss function guides the network effectively towards making better predictions.
The choice of loss function is not arbitrary; it directly depends on the type of problem you are trying to solve. Using an inappropriate loss function can lead to poor model performance, even if the network architecture is sound. Let's look at the standard loss functions for common machine learning tasks.
Regression problems involve predicting continuous numerical values, such as predicting house prices, temperature, or stock values.
Mean Squared Error is perhaps the most common loss function for regression. It calculates the average of the squared differences between the predicted values (y^i) and the true values (yi).
MSE=N1i=1∑N(yi−y^i)2Here, N is the number of samples in the batch. Squaring the difference has two main effects:
In Keras, you specify MSE using the string identifier 'mean_squared_error'
or 'mse'
, or by importing and using the class keras.losses.MeanSquaredError
.
Mean Absolute Error calculates the average of the absolute differences between predictions and true values.
MAE=N1i=1∑N∣yi−y^i∣MAE measures the average magnitude of the errors without considering their direction. Unlike MSE, it does not disproportionately penalize large errors, making it more robust to outliers. If your dataset contains significant outliers that you don't want to dominate the loss calculation, MAE might be a better choice.
In Keras, use the identifier 'mean_absolute_error'
or 'mae'
, or the class keras.losses.MeanAbsoluteError
.
Comparison of MSE and MAE loss for predicting a target value of 2. Note how the loss increases much more steeply for MSE as the prediction moves further away (e.g., prediction = 10), highlighting its sensitivity to large errors or outliers compared to MAE.
Classification problems involve predicting a discrete category or class label. The choice of loss function depends on whether it's binary (two classes) or multi-class (more than two classes), and how the labels are formatted.
Used for binary classification problems where there are only two possible outcome classes (e.g., spam/not spam, cat/dog). Typically, the network's final layer for this task has a single output unit with a sigmoid activation function, producing a probability between 0 and 1. The target labels should be 0 or 1.
Binary crossentropy measures the dissimilarity between the true label (y, which is 0 or 1) and the predicted probability (y^). The formula for a single prediction is:
Loss=−(ylog(y^)+(1−y)log(1−y^))The loss is low when the predicted probability y^ is close to the true label y, and high otherwise. For example, if the true label y=1, the loss simplifies to −log(y^). This value approaches 0 as y^ approaches 1, and approaches infinity as y^ approaches 0. Conversely, if y=0, the loss is −log(1−y^), which penalizes predictions close to 1.
In Keras, use the identifier 'binary_crossentropy'
or the class keras.losses.BinaryCrossentropy
.
Used for multi-class classification problems where each sample belongs to exactly one of three or more classes (e.g., classifying images of digits 0-9). This loss function expects the target labels to be one-hot encoded. A one-hot encoded label is a vector the same length as the number of classes, containing all zeros except for a single 1 at the index corresponding to the true class. For example, if classes are 'cat', 'dog', 'bird', the label for 'dog' would be [0, 1, 0]
.
The network's final layer for this task typically has C output units (where C is the number of classes) and uses a softmax activation function. Softmax ensures the outputs are probabilities that sum to 1 across all classes.
Categorical crossentropy compares the distribution of the predicted probabilities (y^) against the true distribution (y, the one-hot vector). The formula for a single sample is:
Loss=−c=1∑Cyclog(y^c)Where C is the number of classes, yc is 1 if c is the true class and 0 otherwise, and y^c is the predicted probability for class c. Since only one yc is 1 (the true class), the sum effectively becomes just −log(y^true_class). It penalizes the model heavily if it assigns a low probability to the correct class.
In Keras, use 'categorical_crossentropy'
or keras.losses.CategoricalCrossentropy
.
This loss function serves the same purpose as categorical crossentropy (multi-class classification) but is used when the target labels are provided as integers rather than one-hot encoded vectors. For example, instead of [0, 1, 0]
for 'dog', the label would simply be the integer 1
(assuming 'cat' is 0, 'dog' is 1, 'bird' is 2).
Using sparse categorical crossentropy is often more convenient as it avoids the need to explicitly convert integer labels to one-hot vectors. The calculation performed is mathematically equivalent to categorical crossentropy; Keras handles the conversion internally. Like categorical crossentropy, it requires the model's final layer to use a softmax activation.
In Keras, use 'sparse_categorical_crossentropy'
or keras.losses.SparseCategoricalCrossentropy
. Choose between categorical_crossentropy
and sparse_categorical_crossentropy
based purely on the format of your target labels: one-hot vectors require the former, while integer labels require the latter.
You select the loss function when you compile your model using the compile()
method. You can specify it using its string identifier (most common) or by instantiating a loss class from keras.losses
.
import keras
# Example for a regression model
model.compile(optimizer='adam',
loss='mean_squared_error') # Using string identifier for MSE
# Example for a binary classification model
model.compile(optimizer='rmsprop',
loss='binary_crossentropy', # Using string identifier
metrics=['accuracy'])
# Example for multi-class classification with integer labels
model.compile(optimizer='adam',
loss=keras.losses.SparseCategoricalCrossentropy(), # Using class instance
metrics=['accuracy'])
# Example for multi-class classification with one-hot labels
model.compile(optimizer='adam',
loss='categorical_crossentropy', # Using string identifier
metrics=['accuracy'])
Choosing the correct loss function is a foundational step in configuring the training process. It defines the quantity the model will strive to minimize, directly influencing how the model learns from the data. With the loss function defined, the next piece of the puzzle is the optimization algorithm, which dictates how the model's weights are updated based on the calculated loss.
© 2025 ApX Machine Learning