As highlighted in the chapter introduction, Bayesian inference hinges on understanding the posterior distribution p(θ∣D). Often, we aren't interested in the entire distribution itself, but rather in summaries derived from it. These summaries typically take the form of expectations. For instance, the posterior mean, a common point estimate for parameters θ, is calculated as:
E[θ∣D]=∫θp(θ∣D)dθSimilarly, we might want to compute the posterior variance, credible intervals, or the probability of a specific hypothesis, all of which involve integrating some function f(θ) with respect to the posterior distribution:
E[f(θ)∣D]=∫f(θ)p(θ∣D)dθThe core difficulty arises because the posterior p(θ∣D) is frequently high-dimensional and lacks a standard analytical form. Direct computation of these integrals is usually impossible.
This is where the idea of Monte Carlo integration comes into play. It's a numerical method that approximates the value of a definite integral using random sampling. The intuition is grounded in the Law of Large Numbers (LLN).
Suppose we want to estimate the expectation E[f(θ)]=∫f(θ)p(θ)dθ, where θ is drawn from the distribution p(θ). If we could somehow generate S independent samples θ(1),θ(2),...,θ(S) directly from the distribution p(θ), the LLN tells us that the sample average of f(θ) will converge to the true expectation as the number of samples increases.
We define the Monte Carlo estimator of E[f(θ)] as:
E^S[f(θ)]=S1s=1∑Sf(θ(s))where θ(s)∼p(θ) for s=1,...,S.
As S→∞, this estimator converges to the true expectation:
S→∞limS1s=1∑Sf(θ(s))=E[f(θ)]Furthermore, the Central Limit Theorem (CLT) often applies. If the variance of f(θ), denoted Var[f(θ)], is finite, then for large S, the distribution of the estimator E^S[f(θ)] is approximately Normal:
E^S[f(θ)]≈N(E[f(θ)],SVar[f(θ)])This is immensely useful because it tells us how the accuracy of our estimate improves with more samples. The standard error of the estimate decreases proportionally to 1/S. We can estimate Var[f(θ)] using the sample variance of the computed f(θ(s)) values, giving us a way to quantify the uncertainty in our Monte Carlo estimate.
Let's consider a simple example. Suppose our parameter θ follows a Beta distribution, θ∼Beta(α,β), perhaps representing the unknown probability of success in a series of Bernoulli trials. The probability density function is p(θ∣α,β)=Γ(α)Γ(β)Γ(α+β)θα−1(1−θ)β−1 for θ∈[0,1].
We want to estimate the mean E[θ]. Analytically, we know E[θ]=α+βα. Let's choose α=3 and β=7. The true mean is E[θ]=3/(3+7)=0.3.
Now, let's pretend we don't know the analytical formula and use Monte Carlo integration. Here, f(θ)=θ. We can easily draw samples θ(s) directly from a Beta(3,7) distribution using standard statistical software libraries.
We draw S samples and compute the Monte Carlo estimate:
E^S[θ]=S1s=1∑Sθ(s)The plot below shows how the estimate E^S[θ] converges towards the true mean (0.3) as the number of samples S increases.
Convergence of the Monte Carlo estimate for the mean of a Beta(3, 7) distribution. The blue line shows the estimate based on S samples (log scale), approaching the true mean (red dashed line) as S increases.
This simple example illustrates the core principle: if you can sample directly from the target distribution p(θ), you can approximate expectations using sample averages.
The power of Monte Carlo integration relies entirely on our ability to draw independent samples θ(s) from the target distribution p(θ). In Bayesian inference, our target distribution is the posterior p(θ∣D).
Unfortunately, for most interesting models, the posterior distribution p(θ∣D)∝p(D∣θ)p(θ) is a complex, high-dimensional function known only up to a normalizing constant (the evidence p(D)). Directly generating independent samples from such distributions is typically intractable. We cannot simply call a function like numpy.random.beta
to get samples from an arbitrary posterior.
This significant limitation is precisely why we need Markov Chain Monte Carlo (MCMC) methods. MCMC techniques provide an ingenious workaround. Instead of generating independent samples directly from p(θ∣D), they construct a Markov chain whose states are possible values of θ, and whose equilibrium (or stationary) distribution is exactly the target posterior p(θ∣D).
By running this Markov chain for a sufficiently long time, the states visited by the chain can be treated as samples from the posterior distribution. While these samples are typically correlated (unlike the independent samples assumed in basic Monte Carlo), they can still be used within the Monte Carlo integration framework. The Law of Large Numbers still holds under certain conditions (ergodicity) for Markov chains.
So, MCMC methods allow us to generate the necessary (though correlated) samples θ(1),θ(2),...,θ(S) from distributions we cannot sample from directly. Once we have these samples, we use the standard Monte Carlo estimator S1∑s=1Sf(θ(s)) to approximate posterior expectations E[f(θ)∣D].
The subsequent sections of this chapter will detail how to build and use these Markov chains effectively, starting with the foundational Metropolis-Hastings algorithm.
© 2025 ApX Machine Learning