As we discussed in the chapter introduction, the effectiveness of a VAE often comes down to the quality of its approximate posterior, qϕ(z∣x). A widely adopted simplification in constructing this approximate posterior is the mean-field approximation. This approach assumes that the latent variables zi in the latent vector z=(z1,z2,…,zD) are mutually independent given the input x. Mathematically, this means the approximate posterior factorizes into a product of individual distributions:
qϕ(z∣x)=i=1∏Dqϕ(zi∣x)Typically, each qϕ(zi∣x) is modeled as a univariate Gaussian distribution whose mean μi(x) and variance σi2(x) are output by the encoder network. This assumption is computationally convenient. It simplifies the calculation of the KL divergence term DKL(qϕ(z∣x)∣∣p(z)) in the ELBO, especially when the prior p(z) is also a factorized Gaussian (e.g., a standard normal distribution N(0,I)). The KL divergence then decomposes into a sum of KL divergences between univariate Gaussians, which has a closed-form solution.
However, this simplification comes at a cost. The primary limitation of the mean-field approximation is that the true posterior pθ(z∣x) is often far more complex and does not factorize in this manner. The underlying generative factors of the data, which the latent variables z aim to capture, can be inherently correlated. For instance, in an image of a face, the pose and lighting might be correlated factors. If z1 represents pose and z2 represents lighting, it's quite plausible that pθ(z1,z2∣x) exhibits strong dependencies.
By enforcing conditional independence, the mean-field qϕ(z∣x) is restricted to a family of distributions that cannot capture any correlations between the latent variables. This has several important consequences:
Poor Fit to the True Posterior: If the true posterior pθ(z∣x) has significant correlations (i.e., its covariance matrix has non-zero off-diagonal elements), a factorized qϕ(z∣x) (which implies a diagonal covariance matrix for z if Gaussian) will be a poor approximation. This mismatch means the ELBO, which we maximize, might be a looser lower bound on the true log-likelihood logpθ(x) than what could be achieved with a more expressive qϕ.
Underestimation of Variances and Inaccurate Uncertainty: The mean-field approximation tends to underestimate the variance of the true posterior or, more generally, it fails to capture the shape of its probability density. It often results in an approximate posterior that is too compact or "overconfident" in certain regions of the latent space, particularly when the true posterior is elongated along directions not aligned with the coordinate axes of the latent space.
Impact on Representation Quality: If the model is forced to represent correlated true factors with independent latent variables, the learned representations might be less meaningful or harder to interpret. The VAE might struggle to disentangle these correlated factors effectively if its inference mechanism cannot even model their joint posterior distribution accurately.
Suboptimal Generative Performance: While VAEs are trained to maximize the ELBO, the ultimate goal is often to generate high-quality samples. A poor posterior approximation can indirectly affect the decoder. If the encoder consistently provides a misleading or overly simplified posterior representation to the decoder during training, the decoder might not learn the true data manifold as effectively.
Contribution to Posterior Collapse: While not the sole cause, a very simple approximate posterior, like the mean-field Gaussian, can sometimes make it easier for the KL divergence term DKL(qϕ(z∣x)∣∣p(z)) to be minimized by making qϕ(z∣x) nearly identical to the prior p(z). In such cases, the latent variables carry little information about the input x, an issue known as posterior collapse. A more flexible qϕ(z∣x) might be better at encoding information from x while still matching the prior to a reasonable degree.
To visualize the discrepancy, imagine a true 2D posterior where z1 and z2 are highly correlated, forming an elliptical distribution tilted with respect to the axes. A mean-field approximation would try to fit this with an axis-aligned ellipse (or a circle if variances are assumed equal).
An illustration comparing a hypothetical true posterior pθ(z∣x) with strong correlation between z1 and z2 (blue, forming a tilted elliptical cloud) and a mean-field approximation qϕ(z∣x) (red, forming a more circular or axis-aligned cloud). The mean-field approximation fails to capture the covariance structure. The shaded regions loosely indicate the high-density areas.
The term "mean-field" itself originates from physics, where complex interacting systems are simplified by assuming each component interacts only with an average effect of all others, ignoring specific pairwise interactions. In VAEs, this translates to assuming independence among the latent variables zi in the approximate posterior, overlooking their potential direct relationships.
Understanding these limitations is important because it motivates the development and use of more sophisticated inference techniques. The subsequent sections in this chapter will introduce methods designed to go beyond the mean-field assumption, aiming for more expressive approximate posteriors that can better capture the complexities of the true posterior pθ(z∣x), leading to tighter ELBOs and potentially improved VAE performance.
Was this section helpful?
© 2025 ApX Machine Learning