To truly understand Variational Autoencoders (VAEs), it's helpful to view them through the lens of probabilistic Latent Variable Models (LVMs). This perspective shifts our focus from simple data compression, as seen in standard autoencoders, towards modeling the underlying probability distribution of the data itself.
At their core, LVMs assume that the high-dimensional data we observe, let's call it x, is generated by some underlying, unobserved (latent) variables, denoted by z. These latent variables typically live in a lower-dimensional space and capture the essential factors of variation within the data.
The fundamental idea is to define a joint probability distribution over both observed and latent variables, p(x,z). We can factorize this joint distribution in a generative manner:
p(x,z)=p(x∣z)p(z)Here:
The ultimate goal in generative modeling is often to model the distribution of the observed data, p(x). Using the rules of probability, we can obtain this by marginalizing out the latent variables:
p(x)=∫p(x,z)dz=∫p(x∣z)p(z)dzIf we could perfectly model p(x∣z) and easily compute this integral, we would have a powerful generative model capable of evaluating the likelihood of new data points and generating samples by first sampling z∼p(z) and then sampling x∼p(x∣z).
While the generative process (z→x) described above is conceptually straightforward, working with LVMs presents a significant challenge: inference. Given an observed data point x, how do we determine the corresponding latent representation z that likely generated it? This involves computing the posterior distribution p(z∣x). Using Bayes' theorem:
p(z∣x)=p(x)p(x∣z)p(z)=∫p(x∣z′)p(z′)dz′p(x∣z)p(z)Here lies the problem: calculating the denominator, p(x) (the marginal likelihood or evidence), requires integrating over all possible latent variables z′. For complex models like those involving deep neural networks for p(x∣z), and continuous latent variables z, this integral is usually intractable. Consequently, computing the true posterior p(z∣x) is also intractable.
This is where the "Variational" aspect of VAEs becomes essential. Since we cannot compute the true posterior p(z∣x), we instead approximate it using a simpler, tractable distribution q(z∣x). This approximate posterior, q(z∣x), is also modeled by a neural network – the encoder network in the VAE. The encoder takes a data point x as input and outputs the parameters of the distribution q(z∣x) (e.g., the mean and variance if q is chosen to be Gaussian). Let's denote the parameters of the encoder network by ϕ, so we write qϕ(z∣x). Similarly, let's denote the parameters of the decoder network (modeling p(x∣z)) by θ, writing pθ(x∣z).
The goal now becomes twofold:
We need an objective function that allows us to jointly optimize the parameters ϕ and θ. Variational inference provides this objective by maximizing a quantity called the Evidence Lower Bound (ELBO). As we will explore in detail in the section "Deriving the Evidence Lower Bound (ELBO)", maximizing the ELBO is equivalent to minimizing the KL divergence between the approximate posterior qϕ(z∣x) and the true posterior pθ(z∣x), while also maximizing the likelihood of the data.
In essence, the VAE framework elegantly combines:
By framing the VAE as an LVM trained using variational inference, we understand why it learns a structured latent space suitable for generation. The objective function explicitly encourages the encoder to produce distributions qϕ(z∣x) that are close to the prior p(z) on average (via a KL divergence term in the ELBO), while simultaneously ensuring that samples from these distributions can be decoded back into realistic data points (via a reconstruction term in the ELBO). This probabilistic approach is the foundation for the VAE's generative capabilities, differentiating it significantly from standard autoencoders.
© 2025 ApX Machine Learning