The standard Metropolis-Hastings (MH) algorithm provides a general framework for sampling from a target distribution, typically the posterior p(θ∣D), when we can evaluate a function proportional to it (like the product of the likelihood and prior, p(D∣θ)p(θ)). The core idea involves proposing a new state θ′ from a proposal distribution q(θ′∣θ(t)) and accepting it with probability:
α(θ′,θ(t))=min(1,p(θ(t)∣D)q(θ′∣θ(t))p(θ′∣D)q(θ(t)∣θ′))
If accepted, θ(t+1)=θ′. Otherwise, θ(t+1)=θ(t). While powerful, the efficiency and effectiveness of the basic MH algorithm heavily depend on the choice of the proposal distribution q. This has led to several variants designed to improve performance in different scenarios.
Random Walk Metropolis (RWM)
Perhaps the simplest and most common variant is the Random Walk Metropolis algorithm. It uses a symmetric proposal distribution, meaning the probability of proposing θ′ starting from θ is the same as proposing θ starting from θ′. Formally, q(θ′∣θ)=q(θ∣θ′).
A typical choice is a Gaussian centered at the current state:
θ′∼N(θ(t),Σ)
where Σ is a covariance matrix, often diagonal, e.g., σ2I. Because the proposal is symmetric, the proposal terms cancel out in the acceptance ratio:
q(θ(t)∣θ′)=q(θ′∣θ(t))
This simplifies the acceptance probability to:
α(θ′,θ(t))=min(1,p(θ(t)∣D)p(θ′∣D))
This is the original Metropolis algorithm (before the Hastings generalization).
Pros:
- Easy to implement.
- Requires no complex tuning beyond the proposal scale (e.g., σ2).
Cons:
- Can be inefficient if the target distribution has complex correlations or varying scales across dimensions. The random walk behavior can lead to slow exploration of the parameter space.
- Performance is highly sensitive to the choice of proposal scale (σ2 or Σ).
Tuning the Proposal Scale:
Choosing the right scale for the random walk is important.
- Too small: Most proposals will be accepted, but the chain moves very slowly and explores the space inefficiently, resulting in high autocorrelation between samples.
- Too large: Many proposals will land in regions of low probability and be rejected. The chain gets stuck, again leading to poor mixing.
An optimal acceptance rate is often empirically found to be around 0.234 for high-dimensional problems, or around 0.44 for one-dimensional problems, though this is a guideline, not a strict rule. Tuning often involves running pilot simulations and adjusting the proposal scale to achieve a reasonable acceptance rate.
Illustration of Random Walk Metropolis paths. Small steps lead to slow exploration, large steps lead to frequent rejections (staying in the same place, indicated by overlapping markers), while a well-tuned step size balances acceptance and movement.
Independent Metropolis-Hastings
In this variant, the proposed state θ′ is drawn from a distribution q(θ′) that does not depend on the current state θ(t).
θ′∼q(θ′)
This means q(θ′∣θ(t))=q(θ′). The acceptance probability uses the full Hastings ratio because the proposal is generally not symmetric:
α(θ′,θ(t))=min(1,p(θ(t)∣D)q(θ′)p(θ′∣D)q(θ(t)))
Requirements:
For this to work well, the proposal distribution q(θ′) should ideally be a good approximation of the target posterior p(θ∣D). Common choices include a multivariate Gaussian or Student's t-distribution fitted to the posterior mode (found via optimization) or based on an initial approximation (like a Laplace approximation).
Pros:
- If q(θ′) is a good approximation of p(θ∣D), the sampler can make large jumps across the parameter space, potentially leading to much faster mixing than RWM.
- Can be effective if a reasonable approximation of the posterior is available.
Cons:
- Finding a good proposal distribution q(θ′) can be challenging.
- If q(θ′) is a poor approximation, especially if its tails are lighter than the target posterior p(θ∣D), the sampler may rarely accept proposals in important regions of the state space, leading to very poor performance. The requirement p(θ∣D)>0⟹q(θ)>0 must hold.
- Acceptance rates can be very low if q is dissimilar to p.
Metropolis-Adjusted Langevin Algorithm (MALA)
MALA incorporates gradient information from the target distribution to propose moves more intelligently, pushing the samples towards regions of higher probability. This is particularly useful in high-dimensional spaces where random walks are inefficient.
The proposal mechanism is based on Langevin dynamics, which involves a drift term based on the gradient of the log-posterior and a random noise term. A discrete-time approximation leads to the proposal:
θ′∼N(θ(t)+2ϵ2∇logp(θ(t)∣D),ϵ2I)
Here, ϵ is a step size parameter, and ∇logp(θ∣D) is the gradient of the log target density (log-likelihood + log-prior). This proposal pushes θ′ "uphill" towards higher density regions.
Because the proposal depends on the gradient at the current point θ(t), it is not symmetric. The proposal density is:
q(θ′∣θ(t))=N(θ′∣μ(θ(t)),ϵ2I)
where μ(θ(t))=θ(t)+2ϵ2∇logp(θ(t)∣D). The full Hastings acceptance ratio must be used:
α(θ′,θ(t))=min(1,p(θ(t)∣D)q(θ′∣θ(t))p(θ′∣D)q(θ(t)∣θ′))
Pros:
- Often explores the parameter space much more efficiently than RWM, especially in higher dimensions, by leveraging gradient information.
- Can handle complex posterior geometries better than simple random walks.
Cons:
- Requires the computation of the gradient ∇logp(θ∣D), which may not be available analytically or computationally feasible for all models. Automatic differentiation tools can help here.
- Requires tuning the step size ϵ. Similar to RWM, ϵ needs to be balanced for good acceptance rates and exploration. Optimal acceptance rates for MALA are often suggested to be around 0.574.
- The discrete-time approximation can still lead to inefficiencies if ϵ is too large.
MALA forms a conceptual bridge to Hamiltonian Monte Carlo (HMC), which uses gradient information even more effectively by introducing auxiliary momentum variables and simulating Hamiltonian dynamics.
Other Variants
- Adaptive MCMC: These algorithms automatically tune proposal parameters (like the covariance matrix Σ in RWM or the step size ϵ in MALA) during the simulation, often during an initial "burn-in" or "warm-up" phase. While powerful, care must be taken to ensure adaptation stops or diminishes over time to maintain the correct stationary distribution guarantees of the Markov chain.
- Component-wise Metropolis-Hastings: Instead of updating all parameters θ=(θ1,...,θd) at once, update one parameter or a block of parameters at a time, conditional on the current values of the others. This is closely related to Gibbs sampling, which is discussed in the next section.
Choosing the right Metropolis-Hastings variant involves trade-offs between implementation complexity, computational cost (especially gradients), tuning effort, and sampling efficiency for the specific problem structure. For complex, high-dimensional posteriors, methods incorporating gradient information like MALA (and later, HMC/NUTS) are often significantly more effective than basic Random Walk Metropolis.