While exact inference algorithms like the Junction Tree provide precise answers for probabilistic queries in PGMs, they often hit a computational wall. For graphs with high treewidth (densely connected structures) or models involving many continuous variables without convenient conjugate relationships, the computational cost of exact inference becomes prohibitive. Calculating marginal or conditional probabilities like P(Xi∣E) or P(Xi,Xj∣E) can require summing or integrating over an exponentially large state space. This is particularly true for the large, complex graphical models frequently encountered in modern machine learning applications.
When exact inference is intractable, we turn to approximate inference strategies. These methods trade theoretical exactness for computational feasibility, aiming to provide sufficiently accurate estimates of the desired posterior probabilities or expectations. The two dominant families of approximate inference techniques, which you've encountered in earlier chapters, are Markov Chain Monte Carlo (MCMC) methods and Variational Inference (VI). Let's see how they are adapted for large PGMs.
MCMC methods approximate the target posterior distribution P(X∣E) (where X represents the set of all unobserved variables and E is the observed evidence) by constructing a Markov chain whose stationary distribution is the target posterior. After a sufficient "burn-in" period, samples drawn from the chain can be treated as (correlated) samples from P(X∣E), allowing us to estimate posterior marginals, expectations, or other quantities of interest.
Gibbs Sampling in PGMs
Gibbs sampling is particularly well-suited for PGMs because it leverages the conditional independence relationships encoded in the graph structure. Recall that Gibbs sampling iteratively samples each variable Xi from its conditional distribution given the current state of all other variables, P(Xi∣X−i,E).
A significant simplification arises from the PGM structure: a variable Xi is conditionally independent of all other variables given its Markov Blanket, denoted MB(Xi). The Markov Blanket consists of its parents, its children, and its children's other parents (sometimes called "spouses"). Therefore, the sampling step simplifies to drawing from:
P(Xi∣X−i,E)=P(Xi∣MB(Xi),E)This local computation makes Gibbs sampling efficient if sampling from P(Xi∣MB(Xi),E) is easy. This is often the case for discrete PGMs or models with conjugate priors.
The Markov Blanket of node Xi (red) includes its parents (P1,P2), its children (C1,C2), and its children's other parents (S1,S2), all highlighted in yellow. For Gibbs sampling, we only need the values of these nodes to sample a new state for Xi.
Practical Considerations for MCMC in PGMs:
Variational Inference reframes inference as an optimization problem. Instead of sampling, VI seeks to find the distribution Q(X) within a predefined family Q that best approximates the true posterior P(X∣E). "Best" is typically measured by minimizing the Kullback.Leibler (KL) divergence, KL(Q(X)∣∣P(X∣E)), which is equivalent to maximizing the Evidence Lower Bound (ELBO), as detailed in Chapter 3.
Mean-Field Variational Inference
The most common choice for the approximating family Q is the mean-field family, where the distribution Q(X) factorizes completely over the individual variables (or sometimes groups of variables):
Q(X)=i=1∏NQi(Xi)This factorization assumes independence among the variables in the approximating distribution, even though they might be dependent in the true posterior. While this is a strong assumption and a source of approximation error, it leads to tractable optimization algorithms like Coordinate Ascent Variational Inference (CAVI).
Under the mean-field assumption, the optimal update for a single factor Qj∗(Xj) is given by:
logQj∗(Xj)∝EQ−j[logP(X,E)]Here, EQ−j[⋅] denotes the expectation taken with respect to all other factors Qi(Xi) for i=j. The term logP(X,E) is the log of the joint probability of all variables (hidden X and observed E), which can be readily obtained from the PGM definition (product of conditional probability distributions or factors).
Crucially, because of the PGM structure, the expectation EQ−j[logP(X,E)] often simplifies. The terms in logP(X,E) that involve Xj typically only depend on Xj and its neighbours in the graph. The expectation only needs to be computed over the factors Qk(Xk) corresponding to these neighbouring variables. CAVI iteratively updates each Qj∗(Xj) using this rule until the ELBO converges.
Practical Considerations for VI in PGMs:
The choice between MCMC and VI depends heavily on the specific problem, the structure of the PGM, the size of the data, and the requirements for accuracy versus speed.
In many practical scenarios involving large PGMs, VI (especially SVI or BBVI) provides a valuable tool for obtaining reasonably good inference results within acceptable timeframes, even if it doesn't achieve the theoretical guarantees of MCMC. Understanding the trade-offs between these powerful approximate inference techniques allows you to select the most appropriate method for your specific modeling task.
© 2025 ApX Machine Learning