As introduced, Variational Inference (VI) seeks an approximation q(z) to the true posterior p(z∣x) by turning the inference problem into an optimization problem. The central quantity we optimize is the Evidence Lower Bound (ELBO). Let's derive this important objective function.
Our goal is to make q(z) as close as possible to the true posterior p(z∣x). A standard way to measure the dissimilarity between two probability distributions is the Kullback-Leibler (KL) divergence. We want to minimize the KL divergence from our approximation q(z) to the true posterior p(z∣x):
KL(q(z)∣∣p(z∣x))=∫q(z)logp(z∣x)q(z)dz
Minimizing this KL divergence directly is difficult because it involves the true posterior p(z∣x), which itself contains the intractable evidence term p(x) since p(z∣x)=p(x,z)/p(x).
Let's expand the definition of the posterior within the KL divergence formula:
∫q(z)logq(z)dz: This is the definition of the negative entropy of the distribution q(z), often written as −H(q). In expectation notation, it's Eq(z)[logq(z)].
∫q(z)logp(x,z)dz: This is the expectation of the log joint probability p(x,z) under the distribution q(z), written as Eq(z)[logp(x,z)].
∫q(z)logp(x)dz: Since logp(x) is a constant with respect to the integration variable z, we can pull it out of the integral. The remaining integral ∫q(z)dz is simply 1, because q(z) is a probability distribution. So, this term simplifies to logp(x).
Substituting these back into the equation for the KL divergence:
The term denoted L(q) is the Evidence Lower Bound (ELBO):
L(q)=Eq(z)[logp(x,z)]−Eq(z)[logq(z)]
This can also be written using the entropy H(q)=−Eq(z)[logq(z)] as:
L(q)=Eq(z)[logp(x,z)]+H(q)
Why is it a Lower Bound?
The KL divergence has the property that KL(A∣∣B)≥0 for any distributions A and B, with equality holding if and only if A=B. Therefore, KL(q(z)∣∣p(z∣x))≥0.
Looking back at the rearranged equation:
logp(x)=L(q)+KL(q(z)∣∣p(z∣x))
Since the KL term is non-negative, it immediately follows that:
logp(x)≥L(q)
This confirms that L(q) is indeed a lower bound on the logarithm of the model evidence.
The Optimization Objective
The identity logp(x)=L(q)+KL(q(z)∣∣p(z∣x)) provides the key insight for variational inference. We want to find the distribution q(z) within our chosen family that minimizes the KL divergence KL(q(z)∣∣p(z∣x)).
Notice that the log evidence logp(x) is a constant with respect to the parameters defining q(z). Therefore, maximizing the ELBO L(q) is equivalent to minimizing the KL divergence KL(q(z)∣∣p(z∣x)).
Think of it like this: the total log evidence is fixed. It's composed of the ELBO and the KL divergence. If we increase the ELBO, the KL divergence must decrease by the same amount to keep the sum constant.
This is extremely useful because the ELBO, L(q)=Eq(z)[logp(x,z)]−Eq(z)[logq(z)], depends only on the joint distribution p(x,z) (which is typically computable) and our chosen variational distribution q(z). It crucially avoids the intractable evidence p(x).
Variational inference thus proceeds by:
Choosing a family of distributions for q(z), often parameterized by some variational parameters ϕ. Let's write this as qϕ(z).
Finding the parameters ϕ that maximize the ELBO:
ϕ∗=argϕmaxL(qϕ)
The resulting distribution qϕ∗(z) serves as our approximation to the true posterior p(z∣x). The value of the ELBO at the optimum, L(qϕ∗), also provides a lower bound on the log marginal likelihood, which can be useful for model comparison.
Understanding the ELBO and its relationship to the KL divergence and the model evidence is fundamental to grasping how variational inference methods operate. The next sections will explore specific families for q(z) and algorithms for optimizing the ELBO.