Variational Autoencoders (VAEs) represent an intersection of deep learning and Bayesian inference, offering an approach to unsupervised generative modeling. Unlike standard autoencoders which learn a deterministic mapping to and from a lower-dimensional space, VAEs use a probabilistic perspective, rooted in the principles of latent variable modeling and variational inference.
Think of a VAE as consisting of two main components connected through a probabilistic latent space:
The underlying idea is that our observed data is generated from some unobserved, lower-dimensional latent variables . We assume a prior distribution over these latent variables, often chosen to be a simple distribution like a standard multivariate Gaussian . The decoder defines how these latent variables generate the complex data we see.
The central challenge is computing the true posterior distribution . The denominator, , is the marginal likelihood or evidence, and it's generally intractable to compute because it involves integrating over all possible latent variables .
This is precisely where Variational Inference comes in. We introduce the encoder network as an approximation to the true, intractable posterior . As we learned in Chapter 3, VI reframes inference as an optimization problem. We aim to make as close as possible to by minimizing the KL divergence . Minimizing this divergence is equivalent to maximizing the Evidence Lower Bound (ELBO), :
Maximizing the ELBO simultaneously trains both the encoder (parameters ) and the decoder (parameters ). Let's examine the two terms in the ELBO:
Reconstruction Term: . This term measures how well the decoder can reconstruct the original input after encoding it into the latent distribution and then sampling a from that distribution. For Gaussian or Bernoulli likelihoods , this term often simplifies to a mean squared error or a binary cross-entropy loss between the input and the decoder's output . It encourages the VAE to learn meaningful latent representations that capture the essential information needed for reconstruction.
Regularization Term (KL Divergence): . This term acts as a regularizer. It measures the divergence between the approximate posterior distribution produced by the encoder and the prior distribution over the latent variables. By encouraging the encoded distributions to stay close to the prior (e.g., a standard Gaussian), it ensures that the latent space is well-structured and prevents the encoder from collapsing different inputs to distinct, isolated regions (posterior collapse). This regularity is important for the VAE's generative capabilities.
A significant challenge arises when trying to optimize the ELBO using gradient ascent: how do we backpropagate gradients through the sampling step involved in the expectation ? The sampling operation itself is stochastic and non-differentiable.
The VAE solves this using the reparameterization trick. Instead of sampling directly from , we express as a deterministic function of the parameters , the input , and an independent random variable with a fixed distribution (e.g., ).
For example, if our encoder outputs the mean and standard deviation of a Gaussian distribution, we can sample as:
Here, denotes element-wise multiplication. Now, the stochasticity comes from , which does not depend on the model parameters . The path from and to the loss is deterministic, allowing gradients to flow back through and to update the encoder network , and through the decoder for .
Flow diagram illustrating the VAE architecture. Input
xis processed by the encoder network to produce parametersmu(x)andsigma(x)defining the approximate posteriorq_phi(z|x). A latent samplezis generated using the reparameterization trick with noiseeps. The KL divergence term comparesq_phi(z|x)with the priorp(z). The samplezis passed through the decoder network to produce the reconstructionx_hat. The reconstruction loss measures the difference betweenxandx_hat. Both loss terms contribute to the ELBO objective.
VAEs are a foundation of probabilistic deep learning. While they don't typically place priors directly on the weights of the neural networks like BNNs do, they perform Bayesian inference over the latent variables . They learn a deep generative model using variational inference for efficient training.
Important connections to BDL include:
In summary, VAEs provide a scalable and effective method for learning deep generative models with a solid probabilistic foundation based on variational inference. They bridge the gap between complex deep learning architectures and principled Bayesian reasoning about latent structure in data.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with