Having established the Bayesian formulation of Latent Dirichlet Allocation (LDA) in the previous section, we now turn to the challenge of performing posterior inference. Our goal is to compute the posterior distribution of the latent variables, primarily the topic assignments z, given the observed words w and the hyperparameters α and β. That is, we want p(z,θ,ϕ∣w,α,β).
While standard Gibbs sampling could be applied by iteratively sampling each latent variable (z, θ, ϕ) from its conditional distribution, the high dimensionality of the continuous parameters θ (document-topic distributions) and ϕ (topic-word distributions) makes this inefficient. Furthermore, we are often most interested in the topic assignments z themselves, or the expected values of θ and ϕ.
Collapsed Gibbs sampling offers an elegant and often effective alternative for LDA. The core idea is to integrate out, or "collapse," the continuous parameters θ and ϕ analytically, leveraging the conjugacy between the Dirichlet priors and the Multinomial likelihoods inherent in the LDA model. This leaves us with a Gibbs sampler that only needs to iterate through the discrete topic assignments zd,n for each word n in each document d.
The heart of the collapsed Gibbs sampler for LDA is the conditional probability of assigning a specific word token (d,n) (the n-th word in document d) to a particular topic k, given all other topic assignments z¬(d,n), the observed words w, and the hyperparameters α,β. We denote the vocabulary word corresponding to wd,n as v.
Using Bayes' theorem and exploiting the conditional independencies and Dirichlet-Multinomial conjugacy, we can derive this conditional probability. We integrate out θd (the topic proportions for document d) and ϕk (the word distribution for topic k):
p(zd,n=k∣z¬(d,n),w,α,β)∝p(wd,n=v∣zd,n=k,z¬(d,n),β)×p(zd,n=k∣zd,¬n,α)Let's break down the two terms on the right-hand side:
Document-Topic Term p(zd,n=k∣zd,¬n,α): This term reflects how likely topic k is in document d, considering the assignments of other words in the same document (zd,¬n) and the document-topic prior α. Integrating out θd yields a probability proportional to the count of words already assigned to topic k in document d (excluding the current word n), plus the prior parameter αk. Let Nd,k¬n be the count of words in document d (excluding word n) assigned to topic k. Assuming a symmetric prior α for simplicity (i.e., αk=α for all k):
p(zd,n=k∣zd,¬n,α)∝Nd,k¬n+αThis term favors assigning the word to topics already prevalent in the document.
Topic-Word Term p(wd,n=v∣zd,n=k,z¬(d,n),β): This term reflects how likely the specific word v is under topic k, considering all other word assignments across the corpus (z¬(d,n)) and the topic-word prior β. Integrating out ϕ yields a probability proportional to the count of times word v has been assigned to topic k elsewhere in the corpus, plus the prior parameter βv. Let Nk,v¬(d,n) be the count of word v assigned to topic k across all documents, excluding the current instance (d,n). Let Nk¬(d,n) be the total count of words assigned to topic k, excluding the current instance. Assuming a symmetric prior β (i.e., βv=β for all v) and V is the vocabulary size:
p(wd,n=v∣zd,n=k,z¬(d,n),β)∝Nk¬(d,n)+VβNk,v¬(d,n)+βThis term favors assigning the word to topics that frequently generate this specific word type v.
Combining these, the full conditional probability for sampling the topic assignment zd,n is:
p(zd,n=k∣z¬(d,n),wd,n=v,α,β)∝(Nd,k¬n+α)×Nk¬(d,n)+VβNk,v¬(d,n)+βThis formula provides the unnormalized probability for assigning the current word token wd,n to each topic k. We compute this value for all K topics and then normalize them to form a valid probability distribution from which we sample the new topic assignment.
Dependencies in the Collapsed Gibbs sampling update for topic assignment zd,n=k. The probability depends on counts related to the document (Nd,k) and counts related to the topic and word type (Nk,v, Nk), modulated by the priors (α, β).
The algorithm proceeds as follows:
Initialization:
Iteration (MCMC Sampling):
Output:
Although θ and ϕ were integrated out during sampling, we can estimate their posterior expectations using the final counts from the sampler (typically from the last iteration after sufficient burn-in, or averaged over several post-burn-in samples).
The expected document-topic distribution for document d is:
θ^d,k=∑k′=1K(Nd,k′+αk′)Nd,k+αkThe expected topic-word distribution for topic k is:
ϕ^k,v=∑v′=1V(Nk,v′+βv′)Nk,v+βvThese estimated distributions θ^ and ϕ^ represent the learned topic mixtures for each document and the word probabilities defining each topic, respectively.
Collapsed Gibbs sampling is a widely used inference technique for LDA due to its relative simplicity and the fact that it leverages the model's conjugacy properties effectively. By avoiding direct sampling of the continuous parameters, it can sometimes explore the posterior distribution of topic assignments z more efficiently than a standard Gibbs sampler.
However, it is still an MCMC method. Convergence needs to be assessed (e.g., by monitoring the log-likelihood of the data or topic coherence metrics), and a suitable burn-in period is required. The sampler processes words sequentially, which can lead to slow mixing, especially on large datasets or with a high number of topics. Correlations between assignments can mean that many iterations are needed to obtain independent samples from the posterior.
Despite these potential limitations, Collapsed Gibbs sampling provides a solid baseline and a conceptually clear way to perform Bayesian inference for LDA. It contrasts with optimization-based approaches like Variational Bayes, which we will explore next, trading sampling variability for potentially faster convergence via deterministic approximation.
© 2025 ApX Machine Learning