Variational Inference (VI) transforms the challenge of computing the posterior into an optimization problem. We aim to find a distribution within a chosen family that best approximates the true posterior, typically by maximizing the Evidence Lower Bound (ELBO):
The main step is selecting the family . If we allowed to contain all possible distributions, the optimal would be the true posterior itself, bringing us back to the original intractable problem. Therefore, the essence of practical VI lies in choosing a restricted, tractable family for .
One of the most widely used strategies is the mean-field variational family. This approach introduces a significant simplification by assuming that the latent variables in the approximating distribution are mutually independent. Mathematically, we enforce a fully factorized structure:
Here, the joint variational distribution is broken down into a product of independent factors, where each factor governs only a single latent variable (or sometimes a block of variables, though full factorization is common). Each factor is itself a probability distribution, often belonging to a simple parametric family (like Gaussian or Dirichlet) and parameterized by its own set of variational parameters . The goal of VI then becomes finding the optimal parameters that maximize the ELBO.
This factorization assumption has a direct impact on the ELBO terms. The entropy term becomes particularly manageable:
The expectation of the sum becomes the sum of expectations, and because each only depends on , the expectation simplifies to for the -th term. This means the total entropy is just the sum of the individual entropies of the factors. If the factors belong to standard exponential families, their entropies often have closed-form expressions, making this term easy to compute.
The first term, , also simplifies. The expectation is taken over the factorized distribution , meaning we integrate or sum over each according to its factor . The specific calculation depends heavily on the structure of the model's joint probability .
How do we find the optimal form for each factor ? We can optimize the ELBO with respect to one factor while holding the others ( for ) fixed. This is the core idea behind algorithms like Coordinate Ascent Variational Inference (CAVI), which we'll discuss in the next section.
Let's isolate the terms in the ELBO that depend on a specific factor . We can rewrite the ELBO as:
where denotes integration over all variables except . The term inside the parentheses in the first integral is an expectation of the log joint probability taken with respect to all factors except :
This expectation yields a function that depends on (and , and the parameters of the fixed 's). Let's denote . Then the terms in the ELBO depending on are:
This expression looks familiar. It is equal to the negative Kullback-Leibler (KL) divergence between and an unnormalized distribution proportional to , plus a constant (the log normalizer of ):
Maximizing (and thus the full ELBO with respect to , holding others fixed) is equivalent to minimizing the KL divergence . The minimum KL divergence value of zero is achieved when is exactly equal to the normalized distribution . Therefore, the optimal solution for the factor satisfies:
Or, equivalently:
This important result provides a recipe for finding the optimal form of each factor , assuming all other factors () are fixed. It states that the optimal log-density for is obtained by taking the log of the model's joint probability and then averaging over all other variables () according to their current variational distributions . This forms the basis for iterative update schemes like CAVI, where we cycle through the variables, updating each based on the current estimates of the others.
The primary strength of the mean-field approximation is computational tractability. By breaking the dependency structure among latent variables within the variational approximation , we convert a potentially complex optimization problem over a high-dimensional distribution into a series of potentially simpler optimizations over lower-dimensional factors . If the model structure and choice of families are compatible (e.g., using conjugate priors in the model often leads to recognizable forms for ), these updates can sometimes be derived analytically.
However, this simplification comes at a cost. The central assumption is that the variational posterior factors are independent: . If the true posterior exhibits significant correlations between the latent variables , the mean-field approximation will fail to capture these dependencies by definition.
Consider a simple 2D Gaussian example. If the true posterior shows strong negative correlation between and , the best mean-field approximation will be a Gaussian with zero correlation (its contours will be aligned with the axes), even if it manages to center itself correctly and approximate the marginal variances.
Illustration comparing a true posterior with correlation (dashed orange contours) and its best mean-field approximation (solid blue contours). The approximation centers correctly but enforces independence, thus missing the correlation structure present in the true posterior.
This inability to capture posterior correlations is a well-known characteristic of mean-field VI. Because the KL divergence penalizes placing probability mass with where has none, the resulting tends to be more "compact" or concentrated around the mode than the true posterior . This often leads to an underestimation of posterior variances and potentially overly confident uncertainty estimates.
Despite these limitations, the mean-field assumption is foundational in variational inference. Its computational advantages are significant, making Bayesian inference feasible for many complex models and large datasets where MCMC methods might struggle with convergence time or memory requirements. Understanding the mean-field approximation, its derivation via the optimal factor updates, and its inherent limitations regarding posterior dependencies and variance estimation is essential for applying VI effectively and interpreting its results critically. We will now explore the Coordinate Ascent Variational Inference (CAVI) algorithm, which directly implements the iterative updates derived from the mean-field assumption.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with