Variational Autoencoders (VAEs) represent a significant step beyond the autoencoder architectures discussed in previous chapters. While standard autoencoders excel at dimensionality reduction and learning compressed representations for reconstruction, VAEs introduce a probabilistic framework that endows them with powerful generative capabilities. They are designed not just to reconstruct inputs, but to learn a smooth, continuous, and structured latent space from which new data samples can be generated. This section will walk you through the fundamental principles that define VAEs.
Standard autoencoders learn a deterministic encoder function. This means for a given input x, the encoder maps it directly to a single, fixed point z in the latent space. Similarly, the decoder deterministically maps that point z back to a reconstruction x^.
VAEs, however, operate differently by incorporating probability. The VAE encoder does not output a single latent vector for an input x. Instead, it outputs parameters that describe a probability distribution in the latent space. Typically, this is a Gaussian distribution characterized by a mean vector μx and a variance vector σx2 (or more commonly, its logarithm, log(σx2), for numerical stability). So, for each input x, the encoder defines a distribution q(z∣x) from which we can sample a latent vector z.
Think of it this way: a standard autoencoder might say, "This input image maps to this exact coordinate in my latent feature map." A VAE encoder, in contrast, says, "This input image most likely comes from this general region in my latent feature map, a region described by this specific Gaussian distribution."
The encoder network in a VAE, typically a neural network, is tasked with learning this mapping from an input x to the parameters of a probability distribution. For each input, it produces:
These two vectors, μ and log(σ2), fully parameterize a diagonal Gaussian distribution for that input in the latent space. This means we can sample a latent vector z for input x as z∼N(μ,diag(σ2)), where diag(σ2) is a diagonal covariance matrix with σ2 on the diagonal. The term q(z∣x) is the formal notation for this learned conditional probability distribution of the latent variables z given the input x.
The decoder network in a VAE takes a latent vector z and aims to reconstruct the original input x or generate a new sample x^. During training, z is typically sampled from the distribution q(z∣x) defined by the encoder for a specific input x. For generating entirely new data after training, z is sampled from a chosen prior distribution p(z) (often a standard normal distribution).
The decoder itself can also be probabilistic, defining a distribution p(x∣z) which models the probability of observing data x given the latent vector z. For example:
Because the decoder can process any point z sampled from the latent space (either from q(z∣x) or p(z)) to produce a data sample, it effectively functions as a generative model.
Flow of information in a Variational Autoencoder, highlighting the probabilistic encoder and decoder components alongside the sampling step in the latent space.
A primary objective of VAEs is to learn a latent space that is not merely compressed, but also continuous and meaningfully structured.
To achieve this desirable structure, the VAE's loss function, which we will examine in detail in a subsequent section, incorporates a critical regularization term. This regularizer is the Kullback-Leibler (KL) divergence, denoted DKL(q(z∣x)∣∣p(z)). It measures the difference between the distribution q(z∣x) learned by the encoder for a given input x, and a predefined prior distribution p(z) over the latent space. This prior p(z) is typically chosen to be a standard normal distribution, N(0,I) (a Gaussian with zero mean and unit variance in all dimensions).
By penalizing deviations of q(z∣x) from p(z), this KL divergence term encourages the encoder to:
This regularization is what helps create a smooth, dense latent space suitable for sampling and meaningful interpolation. If the encoder were to try to make σ2 very small for each input to perfectly pinpoint its latent position (essentially behaving like a standard autoencoder), the KL divergence term would impose a high penalty.
Once a VAE has been successfully trained, its generative capability can be utilized to create new data samples that resemble the training data but are not direct copies. This process is straightforward:
This ability to sample from a well-behaved latent space and generate novel, coherent data is a hallmark of VAEs, setting them apart from simpler autoencoder variants. The probabilistic encoding and the KL regularization are the twin pillars supporting this capability. However, the act of sampling from q(z∣x) during training introduces a stochastic step, which normally prevents gradients from flowing back through the encoder. A clever technique called the "Reparameterization Trick," which will be covered shortly, elegantly addresses this challenge, enabling end-to-end training of VAEs using standard backpropagation.
Was this section helpful?
© 2025 ApX Machine Learning