Building on the components introduced earlier, let's formalize the basic autoencoder mathematically. An autoencoder consists of two main parts: an encoder and a decoder, which are typically implemented as neural networks.
The encoder, denoted by , takes an input vector and maps it to a lower-dimensional latent representation (or code) , where . This mapping is parameterized by the encoder's parameters , which include weights and biases. We can write this as:
For a simple autoencoder with a single hidden layer, this might look like:
Here, is the weight matrix, is the bias vector, and is an element-wise activation function (like sigmoid, ReLU, or tanh) for the encoder. In practice, the encoder can be a much deeper network with multiple layers.
The decoder, denoted by , takes the latent representation and maps it back to a reconstruction . The goal is for to be as close as possible to the original input . This mapping is parameterized by the decoder's parameters :
Similar to the encoder, a simple single-layer decoder might be formulated as:
Where and are the decoder's weights and biases, and is the decoder's activation function. The choice of often depends on the nature of the input data . For instance, if represents pixel values normalized between 0 and 1, a sigmoid activation is common. If can take any real value (after normalization), a linear activation might be used.
The entire autoencoder combines the encoder and decoder. Given an input , the reconstruction is obtained by first encoding to and then decoding back to :
The autoencoder is trained by minimizing the difference between the original input and its reconstruction . This difference is quantified by a reconstruction loss function, . The overall objective function for the autoencoder, where represents all trainable parameters, is typically the average loss over a dataset of samples:
Substituting the encoder and decoder functions:
The choice of is important and depends on the characteristics of the input data :
Mean Squared Error (MSE): Used when the input data is continuous, often assumed to be Gaussian. It measures the average squared difference between the elements of and . Minimizing MSE corresponds to maximizing the log-likelihood under a Gaussian assumption for the reconstruction error. It's suitable for inputs like normalized image pixels or continuous features.
Binary Cross-Entropy (BCE): Used when the input data is binary or can be interpreted as probabilities (e.g., values in the range [0, 1]). This often applies to images like MNIST where pixel values are treated as Bernoulli parameters. Minimizing BCE corresponds to maximizing the log-likelihood under a Bernoulli distribution assumption for each element of the input. When using BCE, the decoder's final activation function should typically be a sigmoid function to ensure outputs are in the range (0, 1), interpretable as probabilities.
The goal of training is to find the optimal parameters that minimize the objective function . This is achieved using optimization algorithms based on gradient descent. Backpropagation is used to compute the gradients of the loss with respect to all parameters in and . These gradients, and , are then used by optimizers like Adam or RMSprop to iteratively update the parameters.
In essence, the mathematical formulation defines a specific optimization problem: learn functions and such that applying them sequentially () reconstructs the input data as accurately as possible, according to the chosen loss metric . The bottleneck forces the network to learn a compressed representation that captures the most salient information needed for reconstruction.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•