Directly computing the posterior distribution p(z∣x) in Bayesian models is often hindered by the intractable evidence term p(x)=∫p(x,z)dz. While MCMC methods provide a powerful simulation-based approach to approximate the posterior by generating samples, they can demand significant computational resources and time, particularly when dealing with large datasets or high-dimensional parameter spaces.
Variational Inference (VI) offers a fundamentally different approach. Instead of simulating samples from the posterior, VI transforms the inference problem into an optimization problem. The central idea is to select a family of probability distributions, Q, over the latent variables z, where each distribution q(z)∈Q is designed to be tractable (easy to compute expectations, densities, etc.). We then seek the specific distribution q∗(z) within this family that is "closest" to the true, but intractable, posterior p(z∣x).
Measuring Closeness: The KL Divergence
How do we quantify the "closeness" between our approximation q(z) and the true posterior p(z∣x)? The standard measure in VI is the Kullback-Leibler (KL) divergence, denoted KL(q∣∣p). For our specific case, it's defined as:
KL(q(z)∣∣p(z∣x))=∫q(z)logp(z∣x)q(z)dz
The KL divergence is always non-negative (KL(q∣∣p)≥0), and it equals zero if and only if q(z) and p(z∣x) are identical almost everywhere. Minimizing this KL divergence means finding the distribution q(z) within our chosen family Q that best matches the true posterior.
The Challenge and the Solution: Introducing the ELBO
At first glance, minimizing KL(q(z)∣∣p(z∣x)) doesn't seem to solve our original problem. Calculating the KL divergence directly still requires computing the true posterior p(z∣x), which involves the problematic evidence term p(x).
However, we can algebraically rearrange the definition of the KL divergence. Let's expand the term inside the logarithm using the definition of conditional probability, p(z∣x)=p(x)p(x,z):
Recognizing the integrals as expectations with respect to q(z), and noting that logp(x) is a constant with respect to z (so ∫q(z)logp(x)dz=logp(x)∫q(z)dz=logp(x)), we get:
The term denoted L(q) is known as the Evidence Lower Bound (ELBO). Since the KL divergence is always non-negative (KL(q∣∣p)≥0), this equation tells us that the ELBO is always less than or equal to the log model evidence:
logp(x)≥L(q)
This relationship is the foundation of variational inference.
Maximizing the ELBO as an Inference Strategy
Consider the equation logp(x)=L(q)+KL(q(z)∣∣p(z∣x)). The left side, logp(x), is the log evidence of our model given the data. For a fixed model and dataset, this value is constant, regardless of our choice of q(z). Therefore, maximizing the ELBO, L(q), with respect to q(z) must be equivalent to minimizing the KL divergence, KL(q(z)∣∣p(z∣x)).
Crucially, the ELBO, L(q)=Eq(z)[logp(x,z)]−Eq(z)[logq(z)], does not depend directly on the intractable evidence p(x) or the true posterior p(z∣x). It depends only on the joint distribution p(x,z) (which is typically defined by our model specification) and our approximating distribution q(z).
This transforms the inference problem into an optimization problem: find the distribution q∗(z) within the chosen family Q that maximizes the ELBO. The resulting q∗(z) serves as our approximation to the true posterior p(z∣x).
Variational Inference finds the distribution q∗(z) within a tractable family Q that maximizes the Evidence Lower Bound (ELBO). This maximization is equivalent to minimizing the KL divergence between q∗(z) and the true posterior p(z∣x).
The practical challenge now lies in:
Choosing an appropriate family of distributions Q that balances tractability and flexibility.
Developing algorithms to perform the optimization required to maximize L(q).
The following sections will examine common choices for Q, such as the mean-field approximation, and explore algorithms like Coordinate Ascent Variational Inference (CAVI) and Stochastic Variational Inference (SVI) designed to efficiently maximize the ELBO.
Was this section helpful?
Pattern Recognition and Machine Learning, Christopher M. Bishop, 2006 (Springer) - A classic and comprehensive textbook providing a foundational introduction to variational inference, including the derivation of the ELBO and the mean-field approximation.
Variational Inference: A Review for Statisticians, David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe, 2017Journal of the American Statistical Association, Vol. 112 (Taylor & Francis Online)DOI: 10.1080/01621459.2017.1285773 - This review paper provides an accessible and broad overview of variational inference, framing it as an optimization problem and discussing its applications and modern developments.
Probabilistic Machine Learning: Advanced Topics, Kevin Patrick Murphy, 2023 (MIT Press) - A modern, comprehensive textbook offering an in-depth treatment of variational inference, from fundamentals to advanced algorithms and applications.