As highlighted in the chapter introduction, standard autoencoders, while proficient at reconstruction, struggle with generating novel data. Their deterministic nature often results in latent spaces that are not continuous or structured enough for meaningful sampling. Generating realistic samples requires moving beyond mapping an input x to a single point z in the latent space and back to a single reconstruction x′.
Variational Autoencoders introduce a probabilistic perspective to both the encoding and decoding processes, laying the groundwork for effective generative modeling. Instead of deterministic mappings, we work with probability distributions.
In a standard autoencoder, the encoder f maps an input x directly to a latent code z=f(x). In contrast, the VAE encoder doesn't output a single point z. Instead, it outputs parameters for a probability distribution over the latent space, conditioned on the input x. This distribution, denoted as qϕ(z∣x), represents our belief about the likely latent codes z that could have generated the input x. The parameters of the encoder network are denoted by ϕ.
Typically, qϕ(z∣x) is chosen to be a multivariate Gaussian distribution with a diagonal covariance matrix. This simplifies calculations and works well in practice. Therefore, for a given input x, the encoder network (parameterized by ϕ) outputs two vectors:
The approximate posterior distribution is then:
qϕ(z∣x)=N(z∣μϕ(x),diag(σϕ2(x)))Here, N(z∣μ,Σ) denotes a Gaussian distribution over z with mean μ and covariance Σ. Using a diagonal covariance matrix means we assume conditional independence between the dimensions of the latent variable z, given the input x.
Why is this probabilistic encoding significant?
Similar to the encoder, the VAE decoder also adopts a probabilistic view. A standard decoder g takes a latent code z and outputs a single reconstruction x′=g(z). The VAE decoder, parameterized by θ, takes a latent variable z (sampled from qϕ(z∣x) during training, or from the prior p(z) for generation) and outputs the parameters of a probability distribution over the data space x. This distribution is denoted as pθ(x∣z). It represents the likelihood of observing data x given the latent code z.
The choice of the distribution pθ(x∣z) depends on the nature of the data x:
Continuous Data (e.g., normalized image pixels): A common choice is a multivariate Gaussian distribution. The decoder network might output the mean μθ(z), and the covariance could be assumed to be fixed (e.g., σ2I where σ2 is a hyperparameter or learned) or also output by the network.
pθ(x∣z)=N(x∣μθ(z),σ2I)Maximizing the log-likelihood logpθ(x∣z) under this assumption corresponds to minimizing the Mean Squared Error (MSE) between the input x and the generated mean μθ(z), up to a constant factor.
Binary Data (e.g., black and white images): A common choice is a product of independent Bernoulli distributions. The decoder network outputs a vector of probabilities pθ(z), where each element represents the probability of the corresponding dimension of x being 1.
pθ(x∣z)=i=1∏Dpθ,i(z)xi(1−pθ,i(z))1−xiwhere D is the dimensionality of x. Maximizing the log-likelihood logpθ(x∣z) in this case corresponds to minimizing the Binary Cross-Entropy (BCE) loss between the input x and the output probabilities pθ(z).
This probabilistic decoder provides a principled way to define the reconstruction quality. Instead of just minimizing a distance metric, we aim to maximize the likelihood of the original data x under the distribution generated by the decoder, given the latent code z. This fits naturally into the overall probabilistic framework of VAEs and the derivation of their objective function.
Simplified flow of a Variational Autoencoder, highlighting the probabilistic nature of the encoder (qϕ(z∣x)) and decoder (pθ(x∣z)). The encoder maps input x to parameters of a latent distribution, from which z is sampled. The decoder maps z to parameters of an output distribution, modeling the likelihood of the data.
By defining both the encoder and decoder probabilistically, VAEs establish a formal connection to latent variable models and enable the derivation of a well-principled objective function, the Evidence Lower Bound (ELBO), which balances data reconstruction with latent space regularization. This structure is what empowers VAEs to learn latent spaces suitable for generating new data. We will explore the full model perspective and the ELBO derivation in the subsequent sections.
© 2025 ApX Machine Learning