As we discussed in the chapter introduction, directly computing or sampling from complex, high-dimensional posterior distributions p(θ∣D) is often intractable. MCMC algorithms provide a clever workaround: instead of drawing independent samples directly from p(θ∣D), we construct a Markov chain whose states are parameter values θ and whose stationary distribution is precisely the target posterior distribution p(θ∣D). By simulating this chain for long enough, the samples we generate will eventually behave as if they were drawn from the target posterior.
But how does this work? Why can we trust that samples from some cleverly constructed random walk will eventually look like samples from our specific posterior? The answer lies in the mathematical properties of Markov chains. Let's examine the essential theory.
A Markov chain is a sequence of random variables θ(0),θ(1),θ(2),…, where each θ(t) represents the state of the chain at time step t. In our context, a state is typically a specific value (or vector of values) for the model parameters θ. The defining characteristic of a Markov chain is the Markov property: the distribution of the next state θ(t+1) depends only on the current state θ(t) and not on any previous states θ(0),…,θ(t−1).
Mathematically, this is expressed as:
P(θ(t+1)∣θ(t),θ(t−1),…,θ(0))=P(θ(t+1)∣θ(t))The mechanism governing the movement between states is the transition kernel (or transition probability), denoted T(θ′∣θ). This function gives the probability density of moving to state θ′ given that the chain is currently in state θ. MCMC algorithms work by carefully designing this transition kernel T.
A simple illustration of a Markov chain with three states (θ1,θ2,θ3) and associated transition probabilities between them. The next state depends only on the current state.
A central concept is the stationary distribution, often denoted by π(θ). A distribution π(θ) is stationary for a Markov chain with transition kernel T(θ′∣θ) if, once the chain's state distribution reaches π(θ), it stays there. In other words, if the probability of being in state θ at time t is given by π(θ), then the probability of being in state θ′ at time t+1 is also given by π(θ′).
Formally, π is a stationary distribution if it satisfies:
π(θ′)=∫T(θ′∣θ)π(θ)dθFor discrete state spaces, this becomes:
πj=i∑Tjiπiwhere πi is the probability of being in state i, and Tji is the probability of transitioning from state i to state j.
The goal of MCMC is to design a transition kernel T such that its unique stationary distribution π(θ) is exactly our target posterior distribution p(θ∣D). If we can achieve this, running the chain ensures that the distribution of states eventually converges to p(θ∣D).
For an MCMC algorithm to be useful, we need guarantees that the chain will actually converge to the desired stationary distribution π(θ)=p(θ∣D), regardless of the starting state θ(0). This property is called ergodicity. An ergodic Markov chain has a unique stationary distribution, and the distribution of θ(t) converges to this stationary distribution as t→∞.
Two conditions are generally sufficient for ergodicity:
Irreducibility: The chain must be able to reach any part of the state space (where π(θ)>0) from any other part, eventually. This ensures the chain doesn't get stuck in only a portion of the target distribution. Formally, for any initial state θ and any region A with ∫Aπ(θ′)dθ′>0, there must be some t>0 such that P(θ(t)∈A∣θ(0)=θ)>0.
Aperiodicity: The chain should not be restricted to visiting states in a fixed cycle. For example, it shouldn't only be possible to return to state θ after an even number of steps. Aperiodicity helps ensure that convergence isn't hindered by systematic, cyclical behavior. Many standard MCMC algorithms (like Metropolis-Hastings with reasonable proposals, or Gibbs sampling) automatically satisfy this condition because there's usually a non-zero probability of staying in the same state (T(θ∣θ)>0).
If a chain is irreducible and aperiodic, and possesses a stationary distribution π(θ), then it is ergodic. This means:
t→∞limP(θ(t)∈A∣θ(0))=∫Aπ(θ′)dθ′for any starting state θ(0) and any region A in the state space. This limiting behavior is exactly what we need: after a sufficient "burn-in" period, the states θ(t) behave like samples drawn from π(θ)=p(θ∣D).
Designing a transition kernel T and proving directly that p(θ∣D) is its stationary distribution using the integral equation π(θ′)=∫T(θ′∣θ)π(θ)dθ can be difficult. Fortunately, a simpler, sufficient condition exists: detailed balance (also known as reversibility).
A Markov chain satisfies detailed balance with respect to a distribution π(θ) if the probability flow from state θ to state θ′ is equal to the probability flow from θ′ back to θ:
π(θ)T(θ′∣θ)=π(θ′)T(θ∣θ′)If a transition kernel T satisfies detailed balance with respect to π, then π is a stationary distribution for that kernel. We can see this by integrating both sides over θ:
∫π(θ)T(θ′∣θ)dθ=∫π(θ′)T(θ∣θ′)dθ ∫T(θ′∣θ)π(θ)dθ=π(θ′)∫T(θ∣θ′)dθSince ∫T(θ∣θ′)dθ=1 (the total probability of transitioning from θ′ is 1), we get:
∫T(θ′∣θ)π(θ)dθ=π(θ′)This is precisely the condition for π being a stationary distribution.
Most common MCMC algorithms, including Metropolis-Hastings and Gibbs sampling, are constructed specifically to satisfy detailed balance with respect to the target posterior p(θ∣D). This provides a practical way to design algorithms that are guaranteed (under irreducibility and aperiodicity) to converge to the correct distribution.
Understanding these core concepts, the Markov property, stationary distributions, ergodicity (via irreducibility and aperiodicity), and detailed balance, is fundamental to MCMC. This theory provides the mathematical justification for why simulating a carefully constructed Markov chain allows us to generate samples that approximate our target posterior distribution p(θ∣D). By ensuring our chosen MCMC algorithm corresponds to an ergodic chain satisfying detailed balance with respect to p(θ∣D), we gain confidence that the samples generated after a burn-in period can be used for Bayesian inference, letting us approximate posterior means, variances, credible intervals, and other quantities of interest. The following sections will build on this theoretical foundation as we explore specific MCMC algorithms like Metropolis-Hastings, Gibbs sampling, and Hamiltonian Monte Carlo.
© 2025 ApX Machine Learning