While Collapsed Gibbs Sampling provides a way to perform posterior inference in Latent Dirichlet Allocation (LDA), its iterative nature, processing one token assignment at a time, can be computationally intensive for large corpora. Variational Bayes (VB) offers a deterministic alternative that often scales more effectively. Instead of sampling, VB frames inference as an optimization problem.
The core challenge in LDA inference is computing the posterior distribution p(θ,z,β∣w), where θ represents the document-topic distributions, z the topic assignments for each word, β the topic-word distributions, and w the observed words. This posterior is intractable due to the complex dependencies between the variables.
Variational Bayes tackles this by introducing a simpler, approximate distribution, q(θ,z,β), chosen from a family of distributions Q. The goal is to find the distribution q∈Q that is "closest" to the true posterior p. Closeness is typically measured using the Kullback-Leibler (KL) divergence, KL(q∣∣p). Minimizing this KL divergence is equivalent to maximizing a lower bound on the log marginal likelihood of the data, known as the Evidence Lower Bound (ELBO):
L(q)=Eq[logp(θ,z,β,w)]−Eq[logq(θ,z,β)]≤logp(w)For LDA, a common choice for the variational family Q is the mean-field family, which assumes the latent variables are mutually independent in the approximate posterior:
q(θ,z,β∣γ,ϕ,λ)=d=1∏Mq(θd∣γd)k=1∏Kq(βk∣λk)n=1∏Ndq(zdn∣ϕdn)Here:
The parameters γ, ϕ, and λ are the variational parameters we need to optimize to maximize the ELBO.
The mean-field approximation assumes independence between latent variables (θ, β, z) in the variational distribution q, simplifying the intractable dependencies present in the true posterior p.
The ELBO can be maximized using an iterative coordinate ascent approach. We cycle through the variational parameters (ϕ, γ, λ), updating each one while holding the others fixed. The update equations are derived by taking the derivative of the ELBO with respect to each parameter, setting it to zero, and solving. The resulting updates often have an intuitive form, resembling the updates in the Expectation-Maximization (EM) algorithm.
Update ϕdnk: This update estimates the probability that word wdn belongs to topic k. It depends on the current estimates of the expected log probabilities of the topic proportions for document d (related to γd) and the expected log probabilities of word wdn under topic k (related to λk).
ϕdnk∝exp(Eq(θd∣γd)[logθdk]+Eq(βk∣λk)[logβk,wdn])The expectations involve the digamma function Ψ, as E[logθdk]=Ψ(γdk)−Ψ(∑j=1Kγdj). After calculating the right-hand side for all k, ϕdn is normalized so that ∑kϕdnk=1.
Update γdk: This update aggregates the responsibilities (ϕdnk) assigned to topic k by all words within document d, adding the original Dirichlet prior parameter αk. It represents the expected number of words in document d assigned to topic k.
γdk=αk+n=1∑NdϕdnkUpdate λkv: This update aggregates the responsibilities (ϕdnk) assigned to topic k for all instances of word v (where wdn=v) across all documents, adding the original Dirichlet prior parameter ηv. It represents the expected number of times word v was generated by topic k.
λkv=ηv+d=1∑Mn:wdn=v∑ϕdnkThese updates are iterated until the ELBO converges, meaning the variational parameters stabilize.
Once converged, the optimized variational parameters provide approximations to the posterior distributions:
Feature | Variational Bayes (VB) | Collapsed Gibbs Sampling |
---|---|---|
Approach | Optimization (Maximize ELBO) | Sampling |
Nature | Deterministic | Stochastic |
Speed | Generally faster per iteration | Can be slower per iteration |
Scalability | Better for large datasets | Can struggle with large data |
Convergence | Converges to a local optimum of ELBO | Converges to true posterior (in limit) |
Accuracy | Provides an approximation (biased) | Asymptotically exact |
Implementation | Update equations can be complex | Conceptually simpler sampling |
Parallelism | More amenable to certain parallelization | Can be parallelized (e.g., per document) |
VB often converges much faster than Gibbs sampling, especially on large datasets. However, because it optimizes a lower bound based on an approximate distributional family (the mean-field assumption), it might converge to a solution that is a poorer approximation of the true posterior compared to a well-run Gibbs sampler. The mean-field assumption tends to underestimate the variance of the posterior distribution.
For truly massive datasets, even standard batch VB can be too slow as it requires processing the entire corpus in each update step for λ. Stochastic Variational Inference (SVI) addresses this by using stochastic gradient ascent on the ELBO. It processes mini-batches of documents at each step, allowing for online learning and significant speedups, making VB applicable to web-scale datasets.
In summary, Variational Bayes provides a powerful and scalable inference framework for LDA. By framing inference as optimization and using a factorized approximation, it allows us to efficiently estimate topic models from large text collections, complementing the sampling-based approaches like Collapsed Gibbs Sampling.
© 2025 ApX Machine Learning