As we established in the previous section, the Evidence Lower Bound (ELBO) serves as the objective function we maximize when training a Variational Autoencoder. The ELBO consists of two main components: the reconstruction likelihood and a Kullback-Leibler (KL) divergence term. Let's focus on the latter:
LELBO(θ,ϕ;x)=Ez∼qϕ(z∣x)[logpθ(x∣z)]−DKL(qϕ(z∣x)∣∣p(z))The term we are analyzing here is DKL(qϕ(z∣x)∣∣p(z)). This KL divergence measures how much the distribution produced by the encoder, qϕ(z∣x), deviates from a chosen prior distribution over the latent variables, p(z).
Think of the KL divergence term as a regularizer. Its primary function is to impose a structure on the latent space. Without it, the encoder qϕ(z∣x) might learn to place the encodings for different inputs x in arbitrary, non-overlapping regions of the latent space. While this might make reconstruction easier (as each x gets a unique, easily decodable code z), it would create significant problems for generation. If we were to sample a z from the prior p(z) to generate a new data point, it might fall into one of the "empty" regions between encoded clusters, leading the decoder pθ(x∣z) to produce nonsensical output.
The KL divergence term forces the distributions qϕ(z∣x) for all input data points x to stay "close" to the prior distribution p(z). Typically, the prior p(z) is chosen to be a simple, standard multivariate Gaussian distribution with zero mean and unit variance, often denoted as N(0,I).
By minimizing DKL(qϕ(z∣x)∣∣p(z)), we encourage the encoder to:
This regularization makes the latent space more suitable for generation. When we sample z∼p(z) (i.e., sample from a standard Gaussian), the sampled z is likely to be in a region of the latent space that the decoder has seen during training (because the encoded qϕ(z∣x) distributions were pushed towards p(z)). Consequently, the decoder can generate more coherent and meaningful data points.
Imagine encoding two different input data points, x1 and x2, into latent distributions qϕ(z∣x1) and qϕ(z∣x2). Without KL regularization (or with very low weight), these distributions might be sharply peaked and located far from the origin and each other. With KL regularization, they are pulled towards the standard Gaussian prior p(z).
Conceptual diagram showing how KL divergence regularization pulls the encoded distributions (q(z∣x) for different inputs) closer to the prior distribution p(z), promoting overlap and smoothness in the latent space.
In practice, the encoder qϕ(z∣x) is typically designed to output the parameters of a diagonal Gaussian distribution: a mean vector μϕ(x) and a diagonal covariance matrix represented by a variance vector σϕ2(x). So, qϕ(z∣x)=N(z;μϕ(x),diag(σϕ2(x))).
If the prior p(z) is a standard Gaussian N(z;0,I), the KL divergence between qϕ(z∣x) and p(z) can be calculated analytically. For a d-dimensional latent space, the formula is:
DKL(N(μ,diag(σ2))∣∣N(0,I))=21j=1∑d(σj2+μj2−1−logσj2)Here, μj and σj2 are the j-th components of the mean vector μϕ(x) and variance vector σϕ2(x) produced by the encoder for a given input x. This analytical formula is convenient because it allows us to compute this part of the loss directly using the encoder's output, without needing to perform Monte Carlo estimation for the KL term itself.
Maximizing the ELBO involves maximizing the reconstruction likelihood while minimizing the KL divergence (note the negative sign in front of the KL term in the ELBO formula presented earlier, although sometimes the objective is written as minimizing the negative ELBO, where the KL term appears with a positive sign). These two goals are often in tension:
The training process finds a balance between these two objectives. If the KL divergence term dominates the loss too strongly (e.g., if its weight is too high), it can lead to a phenomenon called "posterior collapse," where the encoder effectively ignores the input x and always outputs the prior p(z). In this case, qϕ(z∣x) becomes independent of x, the latent codes z contain little information about the input, and the decoder essentially learns the average output, resulting in poor reconstructions and generation quality.
Understanding the role and behavior of the KL divergence term is fundamental to comprehending how VAEs learn structured latent spaces suitable for generative tasks. It acts as the bridge between encoding data and generating new samples from a learned probabilistic model.
© 2025 ApX Machine Learning