Our primary goal in generative modeling is to learn the underlying probability distribution p(x) of some observed data x. Variational Autoencoders approach this by introducing latent variables z, which are unobserved variables that capture underlying structure or features of the data. We assume data x is generated from these latent variables z via a generative process pθ(x∣z), where z itself is drawn from a prior distribution p(z). The marginal likelihood of an observation x is then given by:
pθ(x)=∫pθ(x∣z)p(z)dz
The parameters θ of this generative model (often a neural network, the "decoder") are what we aim to learn. Maximizing logpθ(x) directly is challenging because this integral is usually intractable, especially when pθ(x∣z) is complex (like a deep neural network) and z is high-dimensional.
Furthermore, to perform tasks like encoding data into its latent representation, we often need access to the posterior distribution pθ(z∣x)=pθ(x)pθ(x∣z)p(z). This is also intractable because computing its denominator, pθ(x), is intractable.
This is where variational inference comes into play. Instead of computing the true posterior pθ(z∣x), we introduce a simpler, tractable distribution qϕ(z∣x) to approximate it. This distribution qϕ(z∣x), often called the variational posterior or inference network (and in VAEs, the "encoder"), is typically parameterized by ϕ (e.g., parameters of another neural network). Our goal is to make qϕ(z∣x) as close as possible to the true posterior pθ(z∣x).
Let's start with the log-likelihood of a data point x, logpθ(x). We can rewrite it using our approximate posterior qϕ(z∣x):
logpθ(x)=Eqϕ(z∣x)[logpθ(x)]
This is valid because logpθ(x) does not depend on z, and ∫qϕ(z∣x)dz=1. Now, we use the definition of conditional probability, pθ(x)=pθ(x,z)/pθ(z∣x):
logpθ(x)=Eqϕ(z∣x)[logpθ(z∣x)pθ(x,z)]
Next, we multiply and divide by qϕ(z∣x) inside the logarithm, a common trick:
The KL divergence is always non-negative (DKL≥0), and it is zero if and only if qϕ(z∣x)=pθ(z∣x).
The first term is defined as the Evidence Lower Bound (ELBO), denoted L(θ,ϕ;x):
L(θ,ϕ;x)=Eqϕ(z∣x)[logqϕ(z∣x)pθ(x,z)]
So, we have the fundamental identity:
logpθ(x)=L(θ,ϕ;x)+DKL(qϕ(z∣x)∣∣pθ(z∣x))
Decomposition of the marginal log-likelihood into the ELBO and the KL divergence between the approximate and true posteriors. Maximizing the ELBO is our tractable objective.
Since DKL(qϕ(z∣x)∣∣pθ(z∣x))≥0, it follows that:
logpθ(x)≥L(θ,ϕ;x)
This is why L(θ,ϕ;x) is called the "Evidence Lower Bound": it provides a lower bound on the log-likelihood of the data (the "evidence"). By maximizing the ELBO with respect to both the generative model parameters θ and the variational parameters ϕ, we are effectively:
Increasing the lower bound on logpθ(x), pushing it higher.
Minimizing the KL divergence DKL(qϕ(z∣x)∣∣pθ(z∣x)), making our approximation qϕ(z∣x) closer to the true posterior pθ(z∣x). If qϕ(z∣x) becomes a perfect approximation, the KL divergence becomes zero, and the ELBO equals the true log-likelihood.
The ELBO can be rewritten in a form that is often more intuitive for VAEs. Starting from L(θ,ϕ;x)=Eqϕ(z∣x)[logpθ(x,z)−logqϕ(z∣x)] and using pθ(x,z)=pθ(x∣z)p(z):
Reconstruction Term:Eqϕ(z∣x)[logpθ(x∣z)]. This term measures how well the decoder pθ(x∣z) can reconstruct the input data x given latent samples z drawn from the encoder's approximation qϕ(z∣x). Maximizing this term encourages the VAE to learn meaningful latent representations z that retain sufficient information to reconstruct x. For instance, if pθ(x∣z) is a Gaussian distribution, this term becomes a squared error loss. If pθ(x∣z) is a Bernoulli distribution (for binary data), this becomes a binary cross-entropy loss.
Regularization Term:DKL(qϕ(z∣x)∣∣p(z)). This term acts as a regularizer. It measures the divergence between the approximate posterior distribution qϕ(z∣x) (output by the encoder for a given x) and the prior distribution p(z) over the latent variables. The prior p(z) is typically chosen to be a simple distribution, like a standard multivariate Gaussian N(0,I). This KL term encourages the encoder to distribute the latent representations z similarly to the prior. This regularization is essential for ensuring that the latent space is well-structured and can be sampled from to generate new data. Without it, the encoder might learn to produce qϕ(z∣x) distributions that are far apart for different x, leading to a non-smooth or "gappy" latent space.
The VAE, therefore, consists of:
An encoder networkqϕ(z∣x) that takes data x and outputs parameters for the distribution of z (e.g., mean and variance if qϕ(z∣x) is Gaussian). Its parameters are ϕ.
A decoder networkpθ(x∣z) that takes a latent sample z and outputs parameters for the distribution of x. Its parameters are θ.
A chosen priorp(z) over latent variables, usually N(0,I).
The training process involves optimizing the ELBO L(θ,ϕ;x) with respect to both θ and ϕ using techniques like stochastic gradient ascent. This derivation provides the theoretical justification for the VAE objective function, balancing data reconstruction with regularization of the latent space. This balance is what allows VAEs to learn rich, structured representations and generate novel data samples. The term "variational" in Variational Autoencoder and Variational Inference refers to methods from the calculus of variations, where we optimize functionals (functions of functions), in this case, finding the best qϕ(z∣x) within a family of distributions parameterized by ϕ.
In the subsequent sections, we will explore each component of this objective, the practicalities of optimizing it (like the reparameterization trick), and the implications of this formulation.