Variational Autoencoders (VAEs) are effectively understood by examining them through the framework of probabilistic Latent Variable Models (LVMs). This approach reorients the focus from straightforward data compression, which is the primary function of standard autoencoders, to explicitly modeling the inherent probability distribution of the data.
At their core, LVMs assume that the high-dimensional data we observe, let's call it , is generated by some underlying, unobserved (latent) variables, denoted by . These latent variables typically live in a lower-dimensional space and capture the essential factors of variation within the data.
The fundamental idea is to define a joint probability distribution over both observed and latent variables, . We can factorize this joint distribution in a generative manner:
Here:
The ultimate goal in generative modeling is often to model the distribution of the observed data, . Using the rules of probability, we can obtain this by marginalizing out the latent variables:
If we could perfectly model and easily compute this integral, we would have a powerful generative model capable of evaluating the likelihood of new data points and generating samples by first sampling and then sampling .
While the generative process () described above is straightforward, working with LVMs presents a significant challenge: inference. Given an observed data point , how do we determine the corresponding latent representation that likely generated it? This involves computing the posterior distribution . Using Bayes' theorem:
Here lies the problem: calculating the denominator, (the marginal likelihood or evidence), requires integrating over all possible latent variables . For complex models like those involving deep neural networks for , and continuous latent variables , this integral is usually intractable. Consequently, computing the true posterior is also intractable.
This is where the "Variational" aspect of VAEs becomes essential. Since we cannot compute the true posterior , we instead approximate it using a simpler, tractable distribution . This approximate posterior, , is also modeled by a neural network – the encoder network in the VAE. The encoder takes a data point as input and outputs the parameters of the distribution (e.g., the mean and variance if is chosen to be Gaussian). Let's denote the parameters of the encoder network by , so we write . Similarly, let's denote the parameters of the decoder network (modeling ) by , writing .
The goal now becomes twofold:
We need an objective function that allows us to jointly optimize the parameters and . Variational inference provides this objective by maximizing a quantity called the Evidence Lower Bound (ELBO). As we will explore in detail in the section "Deriving the Evidence Lower Bound (ELBO)", maximizing the ELBO is equivalent to minimizing the KL divergence between the approximate posterior and the true posterior , while also maximizing the likelihood of the data.
In essence, the VAE framework elegantly combines:
By framing the VAE as an LVM trained using variational inference, we understand why it learns a structured latent space suitable for generation. The objective function explicitly encourages the encoder to produce distributions that are close to the prior on average (via a KL divergence term in the ELBO), while simultaneously ensuring that samples from these distributions can be decoded back into realistic data points (via a reconstruction term in the ELBO). This probabilistic approach is the foundation for the VAE's generative capabilities, differentiating it significantly from standard autoencoders.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with