Variational Autoencoders (VAEs) represent a fascinating intersection of deep learning and Bayesian inference, offering a powerful approach to unsupervised generative modeling. Unlike standard autoencoders which learn a deterministic mapping to and from a lower-dimensional space, VAEs embrace a probabilistic perspective, rooted in the principles of latent variable modeling and variational inference that we've discussed.
Think of a VAE as consisting of two main components connected through a probabilistic latent space:
The underlying idea is that our observed data x is generated from some unobserved, lower-dimensional latent variables z. We assume a prior distribution p(z) over these latent variables, often chosen to be a simple distribution like a standard multivariate Gaussian N(0,I). The decoder pθ(x∣z) defines how these latent variables generate the complex data we see.
The central challenge is computing the true posterior distribution pθ(z∣x)=pθ(x∣z)p(z)/p(x). The denominator, p(x)=∫pθ(x∣z)p(z)dz, is the marginal likelihood or evidence, and it's generally intractable to compute because it involves integrating over all possible latent variables z.
This is precisely where Variational Inference comes in. We introduce the encoder network qϕ(z∣x) as an approximation to the true, intractable posterior pθ(z∣x). As we learned in Chapter 3, VI reframes inference as an optimization problem. We aim to make qϕ(z∣x) as close as possible to pθ(z∣x) by minimizing the KL divergence DKL(qϕ(z∣x)∣∣pθ(z∣x)). Minimizing this divergence is equivalent to maximizing the Evidence Lower Bound (ELBO), L(θ,ϕ;x):
L(θ,ϕ;x)=Eqϕ(z∣x)[logpθ(x∣z)]−DKL(qϕ(z∣x)∣∣p(z))Maximizing the ELBO L(θ,ϕ;x) simultaneously trains both the encoder (parameters ϕ) and the decoder (parameters θ). Let's examine the two terms in the ELBO:
Reconstruction Term: Eqϕ(z∣x)[logpθ(x∣z)]. This term measures how well the decoder can reconstruct the original input x after encoding it into the latent distribution qϕ(z∣x) and then sampling a z from that distribution. For Gaussian or Bernoulli likelihoods pθ(x∣z), this term often simplifies to a mean squared error or a binary cross-entropy loss between the input x and the decoder's output x^. It encourages the VAE to learn meaningful latent representations that capture the essential information needed for reconstruction.
Regularization Term (KL Divergence): DKL(qϕ(z∣x)∣∣p(z)). This term acts as a regularizer. It measures the divergence between the approximate posterior distribution qϕ(z∣x) produced by the encoder and the prior distribution p(z) over the latent variables. By encouraging the encoded distributions to stay close to the prior (e.g., a standard Gaussian), it ensures that the latent space is well-structured and prevents the encoder from collapsing different inputs to distinct, isolated regions (posterior collapse). This regularity is crucial for the VAE's generative capabilities.
A significant challenge arises when trying to optimize the ELBO using gradient ascent: how do we backpropagate gradients through the sampling step involved in the expectation Eqϕ(z∣x)[⋅]? The sampling operation itself is stochastic and non-differentiable.
The VAE solves this using the reparameterization trick. Instead of sampling z directly from qϕ(z∣x), we express z as a deterministic function of the parameters ϕ, the input x, and an independent random variable ϵ with a fixed distribution (e.g., ϵ∼N(0,I)).
For example, if our encoder qϕ(z∣x) outputs the mean μϕ(x) and standard deviation σϕ(x) of a Gaussian distribution, we can sample z as:
z=μϕ(x)+σϕ(x)⊙ϵ,where ϵ∼N(0,I)Here, ⊙ denotes element-wise multiplication. Now, the stochasticity comes from ϵ, which does not depend on the model parameters ϕ. The path from ϕ and θ to the loss L is deterministic, allowing gradients to flow back through μϕ(x) and σϕ(x) to update the encoder network ϕ, and through the decoder for θ.
Flow diagram illustrating the VAE architecture. Input
x
is processed by the encoder network to produce parametersmu(x)
andsigma(x)
defining the approximate posteriorq_phi(z|x)
. A latent samplez
is generated using the reparameterization trick with noiseeps
. The KL divergence term comparesq_phi(z|x)
with the priorp(z)
. The samplez
is passed through the decoder network to produce the reconstructionx_hat
. The reconstruction loss measures the difference betweenx
andx_hat
. Both loss terms contribute to the ELBO objective.
VAEs are a cornerstone of probabilistic deep learning. While they don't typically place priors directly on the weights of the neural networks like BNNs do, they perform Bayesian inference over the latent variables z. They learn a deep generative model pθ(x∣z)p(z) using variational inference for efficient training.
Key connections to BDL include:
In summary, VAEs provide a scalable and effective method for learning deep generative models with a solid probabilistic foundation based on variational inference. They bridge the gap between complex deep learning architectures and principled Bayesian reasoning about latent structure in data.
© 2025 ApX Machine Learning