As we established in the chapter introduction, directly computing the posterior distribution p(z∣x) in Bayesian models is often hindered by the intractable evidence term p(x)=∫p(x,z)dz. While MCMC methods provide a powerful simulation-based approach to approximate the posterior by generating samples, they can demand significant computational resources and time, particularly when dealing with large datasets or high-dimensional parameter spaces.
Variational Inference (VI) offers a fundamentally different approach. Instead of simulating samples from the posterior, VI transforms the inference problem into an optimization problem. The central idea is to select a family of probability distributions, Q, over the latent variables z, where each distribution q(z)∈Q is designed to be tractable (easy to compute expectations, densities, etc.). We then seek the specific distribution q∗(z) within this family that is "closest" to the true, but intractable, posterior p(z∣x).
How do we quantify the "closeness" between our approximation q(z) and the true posterior p(z∣x)? The standard measure in VI is the Kullback-Leibler (KL) divergence, denoted KL(q∣∣p). For our specific case, it's defined as:
KL(q(z)∣∣p(z∣x))=∫q(z)logp(z∣x)q(z)dzThe KL divergence is always non-negative (KL(q∣∣p)≥0), and it equals zero if and only if q(z) and p(z∣x) are identical almost everywhere. Minimizing this KL divergence means finding the distribution q(z) within our chosen family Q that best matches the true posterior.
At first glance, minimizing KL(q(z)∣∣p(z∣x)) doesn't seem to solve our original problem. Calculating the KL divergence directly still requires computing the true posterior p(z∣x), which involves the problematic evidence term p(x).
However, we can algebraically rearrange the definition of the KL divergence. Let's expand the term inside the logarithm using the definition of conditional probability, p(z∣x)=p(x)p(x,z):
KL(q(z)∣∣p(z∣x))=∫q(z)logp(x,z)q(z)p(x)dzWe can separate the terms in the logarithm:
KL(q(z)∣∣p(z∣x))=∫q(z)logq(z)dz−∫q(z)logp(x,z)dz+∫q(z)logp(x)dzRecognizing the integrals as expectations with respect to q(z), and noting that logp(x) is a constant with respect to z (so ∫q(z)logp(x)dz=logp(x)∫q(z)dz=logp(x)), we get:
KL(q(z)∣∣p(z∣x))=Eq(z)[logq(z)]−Eq(z)[logp(x,z)]+logp(x)Rearranging this equation gives us a significant relationship:
logp(x)=L(q)Eq(z)[logp(x,z)]−Eq(z)[logq(z)]+KL(q(z)∣∣p(z∣x))The term denoted L(q) is known as the Evidence Lower Bound (ELBO). Since the KL divergence is always non-negative (KL(q∣∣p)≥0), this equation tells us that the ELBO is always less than or equal to the log model evidence:
logp(x)≥L(q)This relationship is the cornerstone of variational inference.
Consider the equation logp(x)=L(q)+KL(q(z)∣∣p(z∣x)). The left side, logp(x), is the log evidence of our model given the data. For a fixed model and dataset, this value is constant, regardless of our choice of q(z). Therefore, maximizing the ELBO, L(q), with respect to q(z) must be equivalent to minimizing the KL divergence, KL(q(z)∣∣p(z∣x)).
Crucially, the ELBO, L(q)=Eq(z)[logp(x,z)]−Eq(z)[logq(z)], does not depend directly on the intractable evidence p(x) or the true posterior p(z∣x). It depends only on the joint distribution p(x,z) (which is typically defined by our model specification) and our approximating distribution q(z).
This transforms the inference problem into an optimization problem: find the distribution q∗(z) within the chosen family Q that maximizes the ELBO. The resulting q∗(z) serves as our approximation to the true posterior p(z∣x).
Variational Inference finds the distribution q∗(z) within a tractable family Q that maximizes the Evidence Lower Bound (ELBO). This maximization is equivalent to minimizing the KL divergence between q∗(z) and the true posterior p(z∣x).
The practical challenge now lies in:
The following sections will examine common choices for Q, such as the mean-field approximation, and explore algorithms like Coordinate Ascent Variational Inference (CAVI) and Stochastic Variational Inference (SVI) designed to efficiently maximize the ELBO.
© 2025 ApX Machine Learning