Variational Autoencoders (VAEs) are derived using the principles of variational inference. A common objective in generative modeling is to estimate the probability distribution of the observed data, p(x). For models with latent variables z, this involves calculating the marginal likelihood:
p(x)=∫p(x,z)dz=∫pθ(x∣z)p(z)dz
where p(z) is the prior distribution over the latent variables and pθ(x∣z) is the likelihood of the data given the latent variables, typically parameterized by a decoder network with parameters θ.
However, this integral is frequently intractable for complex models and high-dimensional latent spaces. This intractability extends to the true posterior distribution p(z∣x)=p(x∣z)p(z)/p(x), as its denominator p(x) is the very integral we cannot compute. Variational inference addresses this by introducing an approximation to the true posterior, denoted as qϕ(z∣x). This approximate posterior is typically parameterized by an encoder network with parameters ϕ.
The core idea is to make qϕ(z∣x) as close as possible to the true posterior p(z∣x). We measure this "closeness" using the Kullback-Leibler (KL) divergence, DKL(qϕ(z∣x)∣∣p(z∣x)). Our objective is to find parameters ϕ that minimize this KL divergence.
Let's begin with the log-likelihood of the data, logp(x), and see how qϕ(z∣x) and the Evidence Lower Bound (ELBO) emerge.
logp(x)=log∫p(x,z)dz
We can multiply and divide by qϕ(z∣x) inside the integral (assuming qϕ(z∣x)>0 where p(x,z)>0):
logp(x)=log∫qϕ(z∣x)qϕ(z∣x)p(x,z)dz
This can be rewritten as the logarithm of an expectation with respect to qϕ(z∣x):
logp(x)=logEqϕ(z∣x)[qϕ(z∣x)p(x,z)]
Since the logarithm is a concave function, we can apply Jensen's inequality (logE[Y]≥E[logY]) to move the logarithm inside the expectation:
logp(x)≥Eqϕ(z∣x)[logqϕ(z∣x)p(x,z)]
This lower bound is precisely the Evidence Lower Bound (ELBO), often denoted as LELBO or simply L(ϕ,θ;x):
LELBO(ϕ,θ;x)=Eqϕ(z∣x)[logp(x,z)−logqϕ(z∣x)]
By expanding p(x,z)=pθ(x∣z)p(z), we get another common form:
The difference between the true log-likelihood logp(x) and the ELBO is exactly the KL divergence between the approximate posterior and the true posterior:
Since the KL divergence is always non-negative (DKL≥0), the ELBO is indeed a lower bound on the log-likelihood of the data. Maximizing the ELBO with respect to ϕ and θ serves two purposes:
It pushes the ELBO closer to the true log-likelihood, effectively minimizing the KL divergence DKL(qϕ(z∣x)∣∣p(z∣x)), thereby making our approximate posterior qϕ(z∣x) a better approximation of the true posterior p(z∣x).
It indirectly maximizes the log-likelihood logp(x) of our model generating the observed data.
The log marginal likelihood logp(x) is decomposed into the Evidence Lower Bound (ELBO) and the KL divergence between the approximate posterior qϕ(z∣x) and the true posterior p(z∣x). Maximizing the ELBO effectively maximizes logp(x) while minimizing the approximation error.
Deconstructing the ELBO
The ELBO can be rearranged into a more intuitive form that highlights the two primary objectives of a VAE:
Starting from LELBO=Eqϕ(z∣x)[logpθ(x∣z)+logp(z)−logqϕ(z∣x)], we can group terms:
Expected Reconstruction Log-Likelihood:Eqϕ(z∣x)[logpθ(x∣z)]
This term measures how well the decoder pθ(x∣z) can reconstruct the input data x when given a latent code z sampled from the encoder's approximate posterior qϕ(z∣x). It encourages the model to learn latent representations z that retain sufficient information to rebuild x. This is the "autoencoding" part of the VAE. The specific form of logpθ(x∣z) depends on the data type:
For binary data (e.g., black and white images), pθ(x∣z) is often modeled as a product of Bernoulli distributions. Maximizing logpθ(x∣z) then corresponds to minimizing the binary cross-entropy (BCE) loss between the input x and the reconstructed output x^=decoder(z).
For real-valued data (e.g., images with pixel intensities normalized to [0,1] or continuous signals), pθ(x∣z) is often modeled as a Gaussian distribution, N(x∣μθ(z),σ2I). If the variance σ2 is fixed, maximizing this term is equivalent to minimizing the Mean Squared Error (MSE) between x and the decoder's mean output μθ(z).
KL Divergence Regularizer:DKL(qϕ(z∣x)∣∣p(z))
This term acts as a regularizer on the latent space. It measures the dissimilarity between the approximate posterior distribution qϕ(z∣x) (produced by the encoder for a given input x) and the prior distribution p(z) over the latent variables. The prior p(z) is typically chosen to be a simple, fixed distribution, most commonly a standard multivariate Gaussian, N(0,I).
By minimizing this KL divergence (note the negative sign in the ELBO formulation, so we are effectively minimizing DKL by maximizing the ELBO term −DKL), we encourage the encoder to produce latent distributions qϕ(z∣x) that are, on average, close to the prior p(z). This has several benefits:
Smoothness and Continuity: It helps to structure the latent space, making it more continuous and less prone to "holes" or disjoint regions. This is important for the generative capabilities of the VAE, as we want to be able to sample z∼p(z) and generate novel, coherent data.
Regularization: It prevents the encoder from learning overly complex or "cheating" posteriors that might simply memorize the input data in z.
The ELBO comprises two main terms. The first is the expected reconstruction log-likelihood, which pushes the model to accurately reconstruct data. The second is a KL divergence term that regularizes the latent space by encouraging the approximate posterior qϕ(z∣x) to be close to a predefined prior p(z).
The KL Divergence Term in Practice
A common choice for both the prior p(z) and the approximate posterior qϕ(z∣x) is a multivariate Gaussian distribution.
Let p(z)=N(z∣0,I), a standard Gaussian with zero mean and identity covariance matrix.
Let the approximate posterior qϕ(z∣x) also be a Gaussian, but with mean μϕ(x) and a diagonal covariance matrix diag(σϕ,12(x),...,σϕ,J2(x)), where J is the dimensionality of the latent space. The encoder network will output the parameters μϕ(x) and log(σϕ2(x)) (or σϕ(x) directly) for each input x.
For these choices, the KL divergence DKL(qϕ(z∣x)∣∣p(z)) has a convenient analytical solution:
This closed-form expression can be directly incorporated into the VAE's loss function and optimized via gradient descent. The reparameterization trick, which we will discuss in the next section, is essential for backpropagating gradients through the sampling process involved in the expectation Eqϕ(z∣x).
In summary, the ELBO provides a tractable objective function for training VAEs. It elegantly balances the need for accurate data reconstruction with the need for a regularized, smooth latent space suitable for generation. By maximizing the ELBO, we are simultaneously improving our model of the data p(x) and refining our approximation qϕ(z∣x) to the true, intractable posterior p(z∣x). Understanding this formulation is foundational for comprehending how VAEs learn and for developing more advanced VAE architectures and techniques.
Was this section helpful?
Auto-Encoding Variational Bayes, Diederik P Kingma, Max Welling, 2014International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.1312.6114 - The foundational paper that introduced Variational Autoencoders (VAEs) and detailed the derivation and use of the Evidence Lower Bound (ELBO) for training.
Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - Provides a comprehensive explanation of Variational Autoencoders and the ELBO formulation within the context of deep learning.
Pattern Recognition and Machine Learning, Christopher M. Bishop, 2006 (Springer) - A classic textbook providing a foundational and rigorous treatment of variational inference, which is the theoretical basis for the ELBO and VAEs.
Probabilistic Machine Learning: Advanced Topics, Kevin Patrick Murphy, 2023 (MIT Press) - A modern and comprehensive textbook that covers VAEs and the ELBO formulation, building upon foundational probabilistic machine learning concepts.