Complex dependencies often exist between variables in the true posterior distribution, making direct inference challenging. A common approach to address this complexity is the mean-field approximation. This method assumes a factorized form for the approximating distribution, , which considerably simplifies optimization and makes algorithms like Coordinate Ascent Variational Inference (CAVI) straightforward to implement. However, this factorization enforces independence among the latent variables within the approximating distribution . This assumption is often a strong simplification, as the true posterior frequently exhibits strong correlations between its variables.
Imagine a posterior distribution over two variables, and , that are highly correlated, perhaps resembling a tilted ellipse in a 2D plot. A mean-field approximation, restricted to axis-aligned ellipses (since implies independence), will fundamentally fail to capture this correlation structure.
The mean-field approximation (yellow, axis-aligned) cannot capture the correlation present in the true posterior (blue, tilted).
This mismatch can lead to several issues:
To address these limitations, researchers have developed more expressive variational families that relax the strict independence assumption of mean-field VI.
A relatively simple extension is Structured Mean-Field (also known as Block Mean-Field or Variational Message Passing in some contexts). Instead of assuming full factorization, we partition the latent variables into disjoint sets and assume factorization between these sets, but allow dependencies within each set: This allows the approximation to capture correlations among the variables within the -th group. The choice of partitioning is important and often guided by the structure of the probabilistic model itself (e.g., grouping variables that appear together in factors of the joint distribution ).
While more flexible than standard mean-field, optimizing structured mean-field approximations can be more complex. The CAVI updates involve expectations with respect to the joint distributions within each block, which might not have simple closed-form solutions unless the block structures are chosen carefully.
A powerful and popular approach for constructing highly flexible variational distributions is using Normalizing Flows. The core idea is to start with a simple base distribution (e.g., a standard multivariate Gaussian) for which we can easily compute densities and draw samples. Then, we transform this simple distribution through a sequence of invertible functions : The final variable follows a potentially much more complex distribution . Because each transformation is invertible, we can recover the initial noise from the final sample : .
Crucially, if the transformations are chosen such that the determinant of their Jacobian matrix, , is computationally tractable, we can compute the density of the final distribution using the change of variables formula from probability theory: Here, and are obtained by inverting the transformations starting from . The functions are typically parameterized (e.g., using neural networks), and these parameters are optimized to maximize the ELBO.
Examples of flow transformations include:
Normalizing flows allow to approximate arbitrarily complex posterior distributions, provided the flow is sufficiently deep and expressive. They significantly improve the flexibility over mean-field approximations, often resulting in tighter ELBOs and better capturing of posterior dependencies.
Another class of advanced families involves Implicit Distributions. These are distributions from which it is easy to sample, but difficult or impossible to evaluate the probability density function itself. Often, these are defined through a sampling process: where is drawn from a simple noise distribution (e.g., Gaussian) and is a complex, non-invertible function (like a deep neural network) parameterized by .
The challenge with implicit distributions is the entropy term in the ELBO, which requires evaluating the density . Various techniques exist to work around this:
Implicit distributions are particularly useful when the primary goal is to generate samples that resemble the posterior, even if density evaluation is not strictly necessary for the downstream task.
Related to some types of normalizing flows, Autoregressive Models explicitly factorize the variational distribution using the chain rule of probability, without necessarily requiring each step to be easily invertible: Each conditional distribution can be modeled using a flexible function, often a neural network that takes the preceding variables as input and outputs the parameters of the distribution for (e.g., mean and variance if is Gaussian). Examples include models like MADE (Masked Autoencoder for Distribution Estimation) or PixelCNN/RNN applied to latent variables.
These models can capture arbitrary dependencies, as they do not impose conditional independence assumptions by the chain rule ordering. Sampling requires sequential generation, and density evaluation is typically straightforward by computing each conditional term.
Moving from mean-field VI introduces a trade-off between approximation accuracy and computational cost/complexity:
The choice of variational family is an important modeling decision. While mean-field VI provides a scalable and often effective baseline, understanding and utilizing these more advanced families is essential when posterior dependencies are strong or when higher fidelity posterior approximations are required for accurate uncertainty quantification and model performance. Model checking and comparing the tightness of the ELBO across different families can help guide this choice.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•