Latent Dirichlet Allocation (LDA) stands as a prominent example of a Probabilistic Graphical Model applied to text analysis, specifically for uncovering thematic structures, often called "topics," within a collection of documents (a corpus). It operates on the principle of a generative process, meaning it describes a probabilistic mechanism by which the documents could have been created. Understanding this generative story is fundamental to grasping the Bayesian formulation of LDA.
The core idea behind LDA is twofold:
- Each document is modeled as a mixture of a fixed number of topics. For example, a news article about technology might be 70% "Technology," 20% "Business," and 10% "Politics."
- Each topic is modeled as a distribution over words in the vocabulary. The "Technology" topic might assign high probabilities to words like "software," "cloud," "AI," and "network," while the "Business" topic favors words like "stock," "market," "profit," and "company."
LDA treats both the topic mixture for each document and the word distribution for each topic as latent (unobserved) random variables. Furthermore, it assigns a specific topic to each word instance within each document. The only observed data are the words themselves. The goal of inference, which we'll discuss in subsequent sections, is to infer these latent structures (topic mixtures, topic-word distributions, and word-topic assignments) given the observed words.
The Generative Process
Let's outline the step-by-step generative process assumed by LDA for a corpus of M documents, with a predefined number of K topics, and a vocabulary of V unique words.
-
Define Priors:
- Choose a parameter α for the Dirichlet prior over document-topic distributions. Typically, α is a symmetric K-dimensional vector, α=(α1,...,αK), often set to αi=α0/K for some scalar α0.
- Choose a parameter β for the Dirichlet prior over topic-word distributions. Typically, β is a symmetric V-dimensional vector, β=(β1,...,βV), often set to βj=β0/V for some scalar β0.
-
Generate Topic-Word Distributions:
- For each topic k∈{1,...,K}:
- Draw a word distribution ϕk∼Dir(β). ϕk is a V-dimensional vector where ϕkv is the probability of word v occurring under topic k, and ∑v=1Vϕkv=1.
-
Generate Document-Specific Variables:
- For each document d∈{1,...,M}:
- Draw a topic mixture θd∼Dir(α). θd is a K-dimensional vector where θdk is the proportion of topic k in document d, and ∑k=1Kθdk=1.
- Determine the number of words in the document, Nd.
- For each word position n∈{1,...,Nd}:
- Draw a topic assignment zdn∼Cat(θd). zdn indicates which topic generated the n-th word in document d.
- Draw the observed word wdn∼Cat(ϕzdn). wdn is the actual word observed at position n in document d, drawn from the word distribution corresponding to the assigned topic zdn.
The hyperparameters α and β control the characteristics of the topic mixtures and topic-word distributions, respectively. Lower values generally lead to sparser distributions (documents composed of fewer topics, topics focused on fewer words).
Graphical Model Representation
This generative process corresponds directly to a Bayesian Network structure. We can visualize these dependencies using plate notation, where plates (rectangles) indicate replication of variables.
Plate notation for the Latent Dirichlet Allocation model. Circles represent random variables (shaded are latent, double-circle is observed). Rectangles (plates) denote replication. Arrows indicate conditional dependencies. The dashed line from ϕ to w signifies that the specific ϕk used depends on the value of z.
Bayesian Formulation Summary
In this PGM, the latent variables are the document-topic distributions θ={θd}d=1M, the topic-word distributions ϕ={ϕk}k=1K, and the topic assignments for each word Z={zdn}d=1,n=1M,Nd. The observed variables are the words themselves, W={wdn}d=1,n=1M,Nd. The parameters α and β are typically treated as fixed hyperparameters, although they can also be learned (e.g., using empirical Bayes or placing hyperpriors).
The complete joint probability distribution over all variables, given the hyperparameters, factorizes according to the graphical model:
p(W,Z,θ,ϕ∣α,β)=(k=1∏Kp(ϕk∣β))(d=1∏Mp(θd∣α)(n=1∏Ndp(zdn∣θd)p(wdn∣zdn,ϕ)))
Here:
- p(ϕk∣β) is the Dirichlet probability density for topic k's word distribution.
- p(θd∣α) is the Dirichlet probability density for document d's topic distribution.
- p(zdn∣θd) is the Categorical probability mass for assigning topic zdn based on θd.
- p(wdn∣zdn,ϕ) is the Categorical probability mass for observing word wdn given its assigned topic zdn and the relevant topic-word distribution ϕzdn.
The Bayesian formulation sets the stage for inference. Our objective is typically to compute the posterior distribution of the latent variables given the observed documents: p(Z,θ,ϕ∣W,α,β). This posterior distribution reveals the hidden thematic structure. However, calculating this posterior directly is intractable due to the complex dependencies and high dimensionality. This necessitates the use of approximate inference techniques like Collapsed Gibbs Sampling or Variational Bayes, which are the subjects of the following sections.