In the previous sections, we established that Variational Autoencoders (VAEs) approach generative modeling from a probabilistic perspective. Instead of mapping an input to a single point in the latent space, the VAE encoder, parameterized by , outputs parameters for a probability distribution over the latent space. The decoder, parameterized by , then defines a likelihood , representing the probability of generating from a given latent variable .
Our ultimate goal in training a generative model is to maximize the probability (or likelihood) of observing the actual data under our model. This marginal likelihood is given by integrating over all possible latent variables :
Here, is a chosen prior distribution for the latent variables, often a standard multivariate Gaussian (). Unfortunately, computing this integral is generally intractable for complex models like neural networks used as decoders, because it requires integrating over all possible values of , which can be high-dimensional.
Since we cannot directly optimize , VAEs use an approach derived from variational inference. The core idea is to introduce an approximate distribution (our encoder) that aims to estimate the true but intractable posterior distribution . We want to make as close as possible to .
A standard way to measure the "closeness" between two distributions is the Kullback-Leibler (KL) divergence. Let's examine the KL divergence between our approximate posterior and the true posterior :
This represents the information lost when using to approximate . We want to minimize this KL divergence. Let's expand the definition using the definition of the posterior :
Using Bayes' theorem, , we can substitute :
Rearranging the terms inside the expectation:
Since does not depend on , the expectation is simply . We can separate the terms:
The first term is the KL divergence between the approximate posterior and the prior : . So, we have:
Now, let's rearrange this equation to isolate the intractable log-likelihood :
This equation is significant. It decomposes the log-likelihood of the data into three terms. Notice that the KL divergence is always non-negative (). Therefore, if we drop this term, the remaining expression forms a lower bound on the log-likelihood. This lower bound is called the Evidence Lower Bound (ELBO), often denoted as :
Since , we have:
This relationship is fundamental to VAE training. Instead of maximizing the intractable directly, we maximize its tractable lower bound, the ELBO . Maximizing the ELBO serves two simultaneous objectives:
Relationship between the total log likelihood, the ELBO, and the KL divergence gap. Maximizing the ELBO pushes it upwards towards the total log likelihood, effectively minimizing the KL gap.
Let's analyze the two terms comprising the ELBO objective function:
Reconstruction Term: This term measures the expected log-likelihood of the original input data being generated by the decoder , where the latent code is sampled from the distribution proposed by the encoder . Maximizing this term encourages the decoder to learn to reconstruct the input accurately from its latent representation. In practice, for inputs like images, this often translates to minimizing a reconstruction loss like Mean Squared Error (MSE) or Binary Cross-Entropy (BCE), depending on the data distribution assumptions.
Regularization Term (KL Divergence): This term measures the KL divergence between the approximate posterior distribution produced by the encoder for a given input , and the prior distribution (e.g., a standard Gaussian). Maximizing this term (or minimizing the positive KL divergence) encourages the encoder to produce distributions that are close to the prior . This acts as a regularizer, forcing the latent space to adopt the structure of the prior (e.g., a smooth, centered Gaussian cloud). This regularization is what enables the VAE to generate novel samples by drawing directly from the prior and feeding it to the decoder.
The VAE is trained by maximizing the ELBO with respect to both the encoder parameters and the decoder parameters . This objective creates a balance: the reconstruction term pushes for accurate data representation, while the KL divergence term enforces a regular structure on the latent space suitable for generation.
In practice, we often minimize the negative ELBO:
Minimizing this loss function using gradient descent (enabled by the reparameterization trick, discussed next) trains the VAE to find parameters and that yield both good reconstructions and a well-structured latent space aligned with the prior . Understanding the ELBO and its components is central to understanding how VAEs learn to generate data.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with