Building on the components introduced earlier, let's formalize the basic autoencoder mathematically. An autoencoder consists of two main parts: an encoder and a decoder, which are typically implemented as neural networks.
The encoder, denoted by f, takes an input vector x∈Rd and maps it to a lower-dimensional latent representation (or code) z∈Rk, where k<d. This mapping is parameterized by the encoder's parameters θe, which include weights and biases. We can write this as:
z=f(x;θe)
For a simple autoencoder with a single hidden layer, this might look like:
z=σe(Wex+be)
Here, We is the weight matrix, be is the bias vector, and σe is an element-wise activation function (like sigmoid, ReLU, or tanh) for the encoder. In practice, the encoder can be a much deeper network with multiple layers.
The decoder, denoted by g, takes the latent representation z∈Rk and maps it back to a reconstruction x^∈Rd. The goal is for x^ to be as close as possible to the original input x. This mapping is parameterized by the decoder's parameters θd:
x^=g(z;θd)
Similar to the encoder, a simple single-layer decoder might be formulated as:
x^=σd(Wdz+bd)
Where Wd and bd are the decoder's weights and biases, and σd is the decoder's activation function. The choice of σd often depends on the nature of the input data x. For instance, if x represents pixel values normalized between 0 and 1, a sigmoid activation is common. If x can take any real value (after normalization), a linear activation might be used.
The entire autoencoder combines the encoder and decoder. Given an input x, the reconstruction x^ is obtained by first encoding x to z and then decoding z back to x^:
x^=g(f(x;θe);θd)
The autoencoder is trained by minimizing the difference between the original input x and its reconstruction x^. This difference is quantified by a reconstruction loss function, L(x,x^). The overall objective function J(θ) for the autoencoder, where θ={θe,θd} represents all trainable parameters, is typically the average loss over a dataset D={x(1),x(2),...,x(N)} of N samples:
J(θ)=N1∑i=1NL(x(i),x^(i))
Substituting the encoder and decoder functions:
J(θ)=N1∑i=1NL(x(i),g(f(x(i);θe);θd))
The choice of L(x,x^) is important and depends on the characteristics of the input data x:
Mean Squared Error (MSE): Used when the input data is continuous, often assumed to be Gaussian. It measures the average squared difference between the elements of x and x^. LMSE(x,x^)=d1∑j=1d(xj−x^j)2 Minimizing MSE corresponds to maximizing the log-likelihood under a Gaussian assumption for the reconstruction error. It's suitable for inputs like normalized image pixels or continuous features.
Binary Cross-Entropy (BCE): Used when the input data is binary or can be interpreted as probabilities (e.g., values in the range [0, 1]). This often applies to images like MNIST where pixel values are treated as Bernoulli parameters. LBCE(x,x^)=−d1∑j=1d[xjlog(x^j)+(1−xj)log(1−x^j)] Minimizing BCE corresponds to maximizing the log-likelihood under a Bernoulli distribution assumption for each element of the input. When using BCE, the decoder's final activation function σd should typically be a sigmoid function to ensure outputs x^j are in the range (0, 1), interpretable as probabilities.
The goal of training is to find the optimal parameters θ∗={θe∗,θd∗} that minimize the objective function J(θ). This is achieved using optimization algorithms based on gradient descent. Backpropagation is used to compute the gradients of the loss L with respect to all parameters in θe and θd. These gradients, ∇θeJ(θ) and ∇θdJ(θ), are then used by optimizers like Adam or RMSprop to iteratively update the parameters.
In essence, the mathematical formulation defines a specific optimization problem: learn functions f and g such that applying them sequentially (g∘f) reconstructs the input data as accurately as possible, according to the chosen loss metric L. The bottleneck z forces the network to learn a compressed representation that captures the most salient information needed for reconstruction.
© 2025 ApX Machine Learning