Once you've designed the basic architecture of your encoder and decoder and decided on the dimensionality of your latent space, the next important step is to choose an appropriate loss function. The loss function, also known as a cost function or objective function, quantifies how far the autoencoder's reconstruction is from the original input. During training, the autoencoder's weights are adjusted to minimize this loss, effectively learning to create better and better reconstructions, and consequently, more meaningful latent representations.
The choice of loss function is not arbitrary; it primarily depends on the type and characteristics of the data you are working with and the desired properties of the reconstruction. Let's look at the most common choices.
When your input data consists of continuous, real-valued numbers, such as pixel intensities in an image (after normalization) or measurements in a tabular dataset, the Mean Squared Error (MSE) is a very common and effective choice.
MSE measures the average squared difference between the original input values and the reconstructed values. For a single data instance with D features (e.g., D pixels or D columns in tabular data), if xj is the j-th feature of the original input and xj′ is the j-th feature of the reconstructed output, the MSE is calculated as:
LMSE=D1∑j=1D(xj−xj′)2
The total loss for a batch of N data instances would then be the average of these individual MSE values.
Why MSE?
When using MSE, the output layer of your decoder typically uses a linear activation function if the target values are unbounded or normalized to a range like (-1, 1) (in which case, a tanh
activation might also be used for the output layer if inputs are similarly scaled). If your continuous data is normalized to be within [0,1], a sigmoid activation can also be used with MSE, though Binary Cross-Entropy is often preferred in that specific [0,1] scenario for probabilistic interpretations.
Another option for continuous data is the Mean Absolute Error (MAE). It measures the average absolute difference between the original and reconstructed values:
LMAE=D1∑j=1D∣xj−xj′∣
MSE vs. MAE MAE is generally less sensitive to outliers compared to MSE. If your dataset contains significant outliers that you don't want to overly influence the training process, MAE might be a better choice. However, MSE's stronger penalization of large errors can sometimes lead to reconstructions that are, on average, closer to the original for well-behaved data. The choice often comes down to empirical testing or specific domain requirements.
The following chart illustrates how MSE applies a much larger penalty to larger errors compared to MAE.
This chart shows the loss value attributed by MSE and MAE as a function of the error between a single predicted value and its target (assumed to be 0 for simplicity). Notice the quadratic increase for MSE versus the linear increase for MAE.
When your input data is binary (e.g., black and white images where pixels are 0 or 1) or can be interpreted as probabilities (e.g., pixel intensities normalized strictly between 0 and 1), Binary Cross-Entropy (BCE) is typically the preferred loss function.
Binary Cross-Entropy, also known as log loss, measures the dissimilarity between two probability distributions. In the context of autoencoders, it compares the distribution of the original binary/probabilistic input data with the distribution of the reconstructed output, which is usually passed through a sigmoid activation function to ensure its values are also between 0 and 1.
For a single data instance with D features, where xj is the original value and xj′ is the reconstructed probability for the j-th feature, the BCE loss is:
LBCE=−D1∑j=1D[xjlog(xj′)+(1−xj)log(1−xj′)]
Why BCE?
Important Note for BCE:
As highlighted, the choice of loss function is tightly coupled with the activation function used in the output layer of the decoder:
tanh
activation if your target data is normalized to the range [-1, 1]. tanh
outputs values in this range.Mismatched loss functions and output activations can lead to poor training performance or nonsensical results. For instance, using BCE with a linear output layer that produces values outside [0,1] will cause errors or lead the model to learn incorrectly.
Most deep learning frameworks make it easy to specify your chosen loss function.
In TensorFlow/Keras: You can pass the string identifier or the class instance to the compile
method of your model:
# For MSE
model.compile(optimizer='adam', loss='mean_squared_error')
# or
# model.compile(optimizer='adam', loss=tf.keras.losses.MeanSquaredError())
# For BCE
model.compile(optimizer='adam', loss='binary_crossentropy')
# or
# model.compile(optimizer='adam', loss=tf.keras.losses.BinaryCrossentropy())
In PyTorch: You instantiate the loss function class and then call it during your training loop:
# For MSE
criterion = torch.nn.MSELoss()
# ... in training loop:
# loss = criterion(reconstructions, original_inputs)
# For BCE
criterion = torch.nn.BCELoss()
# ... in training loop (ensure reconstructions are sigmoid outputs):
# loss = criterion(reconstructions, original_inputs)
Analyze your input data:
Select based on data type and normalization:
tanh
output activation.Align with output layer activation:
tanh
, or sometimes sigmoid.Choosing the right loss function is a foundational step in building an effective autoencoder. It directly influences what the model learns and how well it performs its primary task of reconstruction, which in turn affects the quality of the features you'll extract from the bottleneck layer. In the upcoming hands-on session, you'll see one of these in action.
Was this section helpful?
© 2025 ApX Machine Learning