While standard Metropolis-Hastings and Gibbs sampling provide foundational approaches for MCMC, they can struggle in high-dimensional or highly correlated posterior landscapes. Their proposals often resemble random walks, leading to slow exploration and high autocorrelation between samples. Hamiltonian Monte Carlo (HMC) offers a significantly more efficient way to explore the parameter space by borrowing ideas from classical physics, specifically Hamiltonian dynamics.
Think of the negative log posterior density as a potential energy surface. We want to explore this surface efficiently. HMC introduces a fictitious particle moving on this surface. The position of this particle corresponds to the parameters θ we want to sample, and we introduce an auxiliary momentum variable p, typically with the same dimension as θ.
In physics, the total energy of a system, the Hamiltonian H, is the sum of potential energy U and kinetic energy K. We adapt this for our statistical purpose:
Potential Energy U(θ): This is defined as the negative log of the target density. Since MCMC only requires the target density up to a constant of proportionality, we can use the negative log of the joint probability (likelihood times prior), ignoring the evidence term:
U(θ)=−log[p(D∣θ)p(θ)]=−logp(θ,D)Minimizing potential energy corresponds to maximizing the posterior probability.
Kinetic Energy K(p): This depends on the auxiliary momentum variables p and a "mass matrix" M, which is typically assumed to be the identity matrix (M=I) for simplicity, unless we have reasons to scale dimensions differently.
K(p)=21pTM−1pIf M=I, this simplifies to K(p)=21∑ipi2. This corresponds to sampling the momentum from a multivariate Gaussian distribution, usually p∼N(0,M).
Hamiltonian H(θ,p): The total "energy" of our fictitious system is:
H(θ,p)=U(θ)+K(p)The core idea is to simulate the particle's movement according to Hamiltonian dynamics. In an ideal physical system, the total energy H(θ,p) remains constant over time. The particle's trajectory is governed by Hamilton's equations:
dtdθi=∂pi∂H=(M−1p)i dtdpi=−∂θi∂H=−∂θi∂U(θ)Intuitively, the first equation states that momentum causes the position (parameters) to change. The second equation states that the force acting on the particle (the negative gradient of the potential energy, which is the gradient of the log posterior) causes the momentum to change. Notice that calculating the change in momentum requires the gradient of the log posterior, ∇θU(θ). This is a key requirement for HMC.
Solving Hamilton's equations analytically is generally impossible for complex posteriors. Instead, we use numerical integration. A popular and effective method is the leapfrog integrator. It updates position and momentum in interleaved steps, providing better stability and preserving geometric properties compared to simpler integrators like Euler's method.
The leapfrog algorithm takes discrete time steps of size ϵ. To move from time t to t+ϵ, it performs three updates:
We repeat these steps L times to simulate the trajectory for a total duration of Lϵ.
An iteration of HMC proceeds as follows:
A conceptual illustration of an HMC trajectory moving across contours of a potential energy surface (representing the negative log posterior). The trajectory explores different parameter values while tending to stay within regions of similar energy.
HMC leverages the gradient information ∇θU(θ) to guide its proposals intelligently. Instead of a random walk, it proposes moves informed by the shape of the posterior landscape. By simulating dynamics for multiple steps L, HMC can generate proposals far from the current state that still have a high probability of acceptance, especially if ϵ is chosen appropriately. This drastically reduces the autocorrelation between samples and allows for much faster exploration of the target distribution compared to random-walk Metropolis or Gibbs sampling, particularly when dealing with many parameters or strong correlations between them.
HMC represents a significant step forward in MCMC technology. Its ability to use gradients for efficient exploration makes it a workhorse algorithm for many modern Bayesian inference problems. Understanding its mechanics provides insight into how these advanced samplers operate and sets the stage for understanding adaptive extensions like the No-U-Turn Sampler (NUTS), which automates the tuning of L.
© 2025 ApX Machine Learning