While the mean-field approximation, where we assume q(z)=∏iqi(zi), simplifies the optimization considerably and makes Coordinate Ascent Variational Inference (CAVI) straightforward, it comes at a cost. This factorization enforces independence among the latent variables zi within the approximating distribution q. However, the true posterior p(z∣x) often exhibits strong correlations between variables.
Imagine a posterior distribution over two variables, z1 and z2, that are highly correlated, perhaps resembling a tilted ellipse in a 2D plot. A mean-field approximation, restricted to axis-aligned ellipses (since q(z1,z2)=q1(z1)q2(z2) implies independence), will fundamentally fail to capture this correlation structure.
The mean-field approximation (yellow, axis-aligned) cannot capture the correlation present in the true posterior (blue, tilted).
This mismatch can lead to several issues:
To address these limitations, researchers have developed more expressive variational families that relax the strict independence assumption of mean-field VI.
A relatively simple extension is Structured Mean-Field (also known as Block Mean-Field or Variational Message Passing in some contexts). Instead of assuming full factorization, we partition the latent variables z into disjoint sets z1,…,zM and assume factorization between these sets, but allow dependencies within each set: q(z)=∏k=1Mqk(zk) This allows the approximation qk(zk) to capture correlations among the variables within the k-th group. The choice of partitioning is crucial and often guided by the structure of the probabilistic model itself (e.g., grouping variables that appear together in factors of the joint distribution p(x,z)).
While more flexible than standard mean-field, optimizing structured mean-field approximations can be more complex. The CAVI updates involve expectations with respect to the joint distributions qk(zk) within each block, which might not have simple closed-form solutions unless the block structures are chosen carefully.
A powerful and popular approach for constructing highly flexible variational distributions is using Normalizing Flows. The core idea is to start with a simple base distribution q0(z0) (e.g., a standard multivariate Gaussian) for which we can easily compute densities and draw samples. Then, we transform this simple distribution through a sequence of invertible functions f1,…,fK: z0∼q0(z0) z1=f1(z0) z2=f2(z1) … zK=fK(zK−1) The final variable z=zK follows a potentially much more complex distribution qK(z). Because each transformation fk is invertible, we can recover the initial noise z0 from the final sample z: z0=f1−1(f2−1(…fK−1(z))).
Crucially, if the transformations fk are chosen such that the determinant of their Jacobian matrix, detJfk, is computationally tractable, we can compute the density of the final distribution qK(z) using the change of variables formula from probability theory: logqK(z)=logq0(z0)−∑k=1Klog∣detJfk(zk−1)∣ Here, z0 and zk−1 are obtained by inverting the transformations starting from z. The functions fk are typically parameterized (e.g., using neural networks), and these parameters are optimized to maximize the ELBO.
Examples of flow transformations include:
Normalizing flows allow q(z) to approximate arbitrarily complex posterior distributions, provided the flow is sufficiently deep and expressive. They significantly improve the flexibility over mean-field approximations, often resulting in tighter ELBOs and better capturing of posterior dependencies.
Another class of advanced families involves Implicit Distributions. These are distributions q(z) from which it is easy to sample, but difficult or impossible to evaluate the probability density function q(z) itself. Often, these are defined through a sampling process: z=g(ϵ,ϕ) where ϵ is drawn from a simple noise distribution (e.g., Gaussian) and g is a complex, non-invertible function (like a deep neural network) parameterized by ϕ.
The challenge with implicit distributions is the entropy term Eq(z)[logq(z)] in the ELBO, which requires evaluating the density q(z). Various techniques exist to work around this:
Implicit distributions are particularly useful when the primary goal is to generate samples that resemble the posterior, even if density evaluation is not strictly necessary for the downstream task.
Related to some types of normalizing flows, Autoregressive Models explicitly factorize the variational distribution using the chain rule of probability, without necessarily requiring each step to be easily invertible: q(z)=∏i=1Dq(zi∣z1,…,zi−1) Each conditional distribution q(zi∣z<i) can be modeled using a flexible function, often a neural network that takes the preceding variables z<i as input and outputs the parameters of the distribution for zi (e.g., mean and variance if q is Gaussian). Examples include models like MADE (Masked Autoencoder for Distribution Estimation) or PixelCNN/RNN applied to latent variables.
These models can capture arbitrary dependencies, as they do not impose conditional independence assumptions beyond the chain rule ordering. Sampling requires sequential generation, and density evaluation is typically straightforward by computing each conditional term.
Moving beyond mean-field VI introduces a trade-off between approximation accuracy and computational cost/complexity:
The choice of variational family is an important modeling decision. While mean-field VI provides a scalable and often effective baseline, understanding and utilizing these more advanced families is essential when posterior dependencies are strong or when higher fidelity posterior approximations are required for accurate uncertainty quantification and model performance. Model checking and comparing the tightness of the ELBO across different families can help guide this choice.
© 2025 ApX Machine Learning