In the previous sections, we established that Variational Autoencoders (VAEs) approach generative modeling from a probabilistic perspective. Instead of mapping an input x to a single point z in the latent space, the VAE encoder, parameterized by ϕ, outputs parameters for a probability distribution qϕ(z∣x) over the latent space. The decoder, parameterized by θ, then defines a likelihood pθ(x∣z), representing the probability of generating x from a given latent variable z.
Our ultimate goal in training a generative model is to maximize the probability (or likelihood) of observing the actual data x under our model. This marginal likelihood is given by integrating over all possible latent variables z:
pθ(x)=∫pθ(x∣z)p(z)dzHere, p(z) is a chosen prior distribution for the latent variables, often a standard multivariate Gaussian (N(0,I)). Unfortunately, computing this integral is generally intractable for complex models like neural networks used as decoders, because it requires integrating over all possible values of z, which can be high-dimensional.
Since we cannot directly optimize pθ(x), VAEs use an approach derived from variational inference. The core idea is to introduce an approximate distribution qϕ(z∣x) (our encoder) that aims to estimate the true but intractable posterior distribution pθ(z∣x)=pθ(x∣z)p(z)/pθ(x). We want to make qϕ(z∣x) as close as possible to pθ(z∣x).
A standard way to measure the "closeness" between two distributions is the Kullback-Leibler (KL) divergence. Let's examine the KL divergence between our approximate posterior qϕ(z∣x) and the true posterior pθ(z∣x):
KL(qϕ(z∣x)∣∣pθ(z∣x))=∫qϕ(z∣x)logpθ(z∣x)qϕ(z∣x)dzThis represents the information lost when using qϕ to approximate pθ. We want to minimize this KL divergence. Let's expand the definition using the definition of the posterior pθ(z∣x):
KL(qϕ(z∣x)∣∣pθ(z∣x))=Ez∼qϕ(z∣x)[logqϕ(z∣x)−logpθ(z∣x)]Using Bayes' theorem, pθ(z∣x)=pθ(x)pθ(x∣z)p(z), we can substitute logpθ(z∣x)=logpθ(x∣z)+logp(z)−logpθ(x):
KL(qϕ(z∣x)∣∣pθ(z∣x))=Ez∼qϕ(z∣x)[logqϕ(z∣x)−(logpθ(x∣z)+logp(z)−logpθ(x))]Rearranging the terms inside the expectation:
KL(qϕ(z∣x)∣∣pθ(z∣x))=Ez∼qϕ(z∣x)[logqϕ(z∣x)−logp(z)−logpθ(x∣z)+logpθ(x)]Since logpθ(x) does not depend on z, the expectation Ez∼qϕ(z∣x)[logpθ(x)] is simply logpθ(x). We can separate the terms:
KL(qϕ(z∣x)∣∣pθ(z∣x))=Ez∼qϕ(z∣x)[logqϕ(z∣x)−logp(z)]−Ez∼qϕ(z∣x)[logpθ(x∣z)]+logpθ(x)The first term is the KL divergence between the approximate posterior qϕ(z∣x) and the prior p(z): KL(qϕ(z∣x)∣∣p(z)). So, we have:
KL(qϕ(z∣x)∣∣pθ(z∣x))=KL(qϕ(z∣x)∣∣p(z))−Ez∼qϕ(z∣x)[logpθ(x∣z)]+logpθ(x)Now, let's rearrange this equation to isolate the intractable log-likelihood logpθ(x):
logpθ(x)=Ez∼qϕ(z∣x)[logpθ(x∣z)]−KL(qϕ(z∣x)∣∣p(z))+KL(qϕ(z∣x)∣∣pθ(z∣x))This equation is significant. It decomposes the log-likelihood of the data into three terms. Notice that the KL divergence KL(qϕ(z∣x)∣∣pθ(z∣x)) is always non-negative (KL≥0). Therefore, if we drop this term, the remaining expression forms a lower bound on the log-likelihood. This lower bound is called the Evidence Lower Bound (ELBO), often denoted as L(θ,ϕ;x):
L(θ,ϕ;x)=Ez∼qϕ(z∣x)[logpθ(x∣z)]−KL(qϕ(z∣x)∣∣p(z))Since KL(qϕ(z∣x)∣∣pθ(z∣x))≥0, we have:
logpθ(x)≥L(θ,ϕ;x)This relationship is fundamental to VAE training. Instead of maximizing the intractable logpθ(x) directly, we maximize its tractable lower bound, the ELBO L(θ,ϕ;x). Maximizing the ELBO serves two simultaneous objectives:
Relationship between the total log likelihood, the ELBO, and the KL divergence gap. Maximizing the ELBO pushes it upwards towards the total log likelihood, effectively minimizing the KL gap.
Let's analyze the two terms comprising the ELBO objective function:
Reconstruction Term: Ez∼qϕ(z∣x)[logpθ(x∣z)] This term measures the expected log-likelihood of the original input data x being generated by the decoder pθ(x∣z), where the latent code z is sampled from the distribution proposed by the encoder qϕ(z∣x). Maximizing this term encourages the decoder to learn to reconstruct the input accurately from its latent representation. In practice, for inputs like images, this often translates to minimizing a reconstruction loss like Mean Squared Error (MSE) or Binary Cross-Entropy (BCE), depending on the data distribution assumptions.
Regularization Term (KL Divergence): −KL(qϕ(z∣x)∣∣p(z)) This term measures the KL divergence between the approximate posterior distribution qϕ(z∣x) produced by the encoder for a given input x, and the prior distribution p(z) (e.g., a standard Gaussian). Maximizing this term (or minimizing the positive KL divergence) encourages the encoder to produce distributions qϕ(z∣x) that are close to the prior p(z). This acts as a regularizer, forcing the latent space to adopt the structure of the prior (e.g., a smooth, centered Gaussian cloud). This regularization is what enables the VAE to generate novel samples by drawing z directly from the prior p(z) and feeding it to the decoder.
The VAE is trained by maximizing the ELBO with respect to both the encoder parameters ϕ and the decoder parameters θ. This objective creates a balance: the reconstruction term pushes for accurate data representation, while the KL divergence term enforces a regular structure on the latent space suitable for generation.
In practice, we often minimize the negative ELBO:
LossVAE=−L(θ,ϕ;x)=−Ez∼qϕ(z∣x)[logpθ(x∣z)]+KL(qϕ(z∣x)∣∣p(z))Minimizing this loss function using gradient descent (enabled by the reparameterization trick, discussed next) trains the VAE to find parameters ϕ and θ that yield both good reconstructions and a well-structured latent space aligned with the prior p(z). Understanding the ELBO and its components is central to understanding how VAEs learn to generate data.
© 2025 ApX Machine Learning