The mean-field approximation, where qϕ(z∣x)=∏iqϕ(zi∣x), simplifies VAE training considerably by assuming independence among the latent variables zi given the input x. However, this assumption is often too restrictive. The true posterior pθ(z∣x) might exhibit complex dependencies between latent dimensions, and forcing qϕ(z∣x) to be factorized can prevent it from accurately modeling these correlations. This discrepancy, often part of the "amortization gap," can limit the VAE's ability to learn rich representations and generate high-fidelity data. Structured variational inference offers a way to move beyond this limitation by explicitly modeling dependencies within the approximate posterior.
Structured variational inference aims to enrich the family of distributions qϕ(z∣x) by allowing for correlations among the latent variables z1,…,zD. Instead of a fully factorized form, qϕ(z∣x) is designed to capture some statistical structure. This allows qϕ(z∣x) to be a more accurate approximation of the true posterior pθ(z∣x), potentially leading to a tighter Evidence Lower Bound (ELBO) and improved model performance.
The core idea is to define qϕ(z∣x) using a model that can represent dependencies. Common approaches include autoregressive models and normalizing flows, both of which allow for flexible and expressive posterior distributions.
Comparison of mean-field and structured (autoregressive) approximate posteriors. In the mean-field case, latent variables zi are conditionally independent given x. In the structured autoregressive case, zi depends on preceding zj (for j<i) and x.
One powerful way to introduce structure is to model qϕ(z∣x) autoregressively. This means the distribution over the latent vector z=(z1,…,zD) is factorized as a product of conditional distributions:
qϕ(z∣x)=qϕ(z1∣x)j=2∏Dqϕ(zj∣z<j,x)Here, z<j denotes (z1,…,zj−1). Each conditional distribution qϕ(zj∣z<j,x) can be parameterized by a neural network that takes x and the previously sampled latent variables z1,…,zj−1 as input. For instance, if each qϕ(zj∣z<j,x) is a Gaussian, its mean μj and standard deviation σj would be functions of x and z<j:
μj,logσj=fj(x,z<j;ϕj)This structure allows qϕ(z∣x) to capture arbitrary dependencies, provided the conditioning neural networks fj are sufficiently expressive. Sampling from such a model is sequential: first sample z1∼qϕ(z1∣x), then z2∼qϕ(z2∣z1,x), and so on. While calculating the density qϕ(z∣x) is straightforward (a product of D terms), the sequential sampling can be slow if D is large.
Techniques like Inverse Autoregressive Flows (IAFs), which you may recall from Chapter 3, provide a way to implement such expressive autoregressive models where sampling can be parallelized, significantly speeding up the process. In IAFs, z is obtained by transforming a noise vector ϵ (where ϵj are independent) using an autoregressive transformation: zj=gj(ϵj;hj(x,ϵ<j)).
Normalizing Flows, also discussed in Chapter 3 (Section 3.5 "Normalizing Flows for Flexible Priors and Posteriors"), offer another general and powerful framework for constructing complex posterior distributions. A normalizing flow transforms a simple base distribution q0(u) (e.g., a standard multivariate Gaussian) through a sequence of invertible transformations f1,…,fK:
z=fK∘⋯∘f1(u),u∼q0(u)The density of z can be computed using the change of variables formula:
qϕ(z∣x)=q0(u)det(∂u∂(fK∘⋯∘f1))−1The parameters of these transformations fk (and potentially the base distribution q0) are learned as part of ϕ, and are typically conditioned on x. This allows qϕ(z∣x) to learn highly flexible distributions. The key is that the transformations fk are designed such that their Jacobians (and thus the determinant) are computationally tractable. Examples include planar flows, radial flows, and more sophisticated flow architectures like RealNVP, MAF, and IAF.
Using normalizing flows for qϕ(z∣x) can significantly increase the expressiveness of the inference network, allowing it to better match the true posterior and thereby tighten the ELBO.
Adopting structured variational inference has several important consequences:
Improved ELBO and Model Quality: A more flexible qϕ(z∣x) can provide a tighter lower bound on the true log-likelihood logpθ(x). This often translates to better generative performance, such as sharper generated samples and higher likelihood scores on test data. The representations learned in the latent space may also become more meaningful as the inference network better captures the underlying data manifold.
Increased Computational Complexity: The primary trade-off is computational cost.
Model Design Choices: You now have more choices to make regarding the architecture of qϕ(z∣x). For autoregressive models, this includes the ordering of latent variables and the architecture of the conditional networks. For normalizing flows, it involves selecting the type and number of flow layers. These choices can impact performance and computational load.
KL Divergence Term: The KL divergence DKL(qϕ(z∣x)∣∣p(z)) in the ELBO might become more challenging. If p(z) is a standard Gaussian and qϕ(z∣x) is a complex distribution (e.g., from a normalizing flow), the KL divergence may no longer have an analytical solution. In such cases, it often needs to be estimated, for example, by sampling z∼qϕ(z∣x) and calculating Eqϕ(z∣x)[logqϕ(z∣x)−logp(z)].
Structured variational inference is particularly beneficial when:
While introducing structure adds complexity, the potential gains in model expressiveness and performance often justify the additional overhead, especially for challenging datasets or when aiming for state-of-the-art results. The techniques discussed here, such as autoregressive models and normalizing flows for qϕ(z∣x), are foundational for building more sophisticated and powerful VAEs. As we proceed, you'll see how these improved inference mechanisms can be combined with other advanced VAE components.
Was this section helpful?
© 2025 ApX Machine Learning