As we established, Bayesian Neural Networks (BNNs) offer a principled way to incorporate uncertainty into deep learning by treating network weights w as random variables with a posterior distribution p(w∣D). The primary challenge lies in characterizing this high-dimensional posterior. While Chapter 2 introduced powerful MCMC methods like Hamiltonian Monte Carlo (HMC), directly applying them to large neural networks encounters significant computational hurdles.
The Scalability Problem with Standard MCMC
Standard HMC requires calculating the gradient of the log posterior with respect to all parameters w. The log posterior is given by Bayes' theorem:
logp(w∣D)=logp(D∣w)+logp(w)−logp(D)
The gradient term ∇wlogp(w∣D) involves the gradient of the log likelihood ∇wlogp(D∣w) and the gradient of the log prior ∇wlogp(w). Assuming the data points D={xi,yi}i=1N are i.i.d., the log likelihood is a sum over the entire dataset:
logp(D∣w)=i=1∑Nlogp(yi∣xi,w)
Calculating ∇wlogp(D∣w) requires a full pass over all N data points. For deep learning datasets where N can be millions or billions, computing this gradient at every step of the HMC leapfrog integration becomes prohibitively expensive. This prevents standard HMC from scaling to typical deep learning scenarios.
Adapting MCMC: Stochastic Gradient Methods
The solution mirrors the approach used to train standard deep neural networks: stochastic gradient estimation using mini-batches. Instead of computing the gradient using the full dataset, we approximate it using a small, randomly sampled subset (a mini-batch) Dt={xi,yi}i∈It of size M≪N at each iteration t.
The gradient of the log posterior is approximated as:
∇wlogp(w∣D)≈MNi∈It∑∇wlogp(yi∣xi,w)+∇wlogp(w)
This introduces noise into the gradient estimates. Simply plugging these noisy gradients into standard HMC dynamics leads to incorrect sampling behavior; the simulated trajectories diverge, and the sampler does not converge to the true posterior distribution.
Stochastic Gradient Hamiltonian Monte Carlo (SGHMC)
Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) is an adaptation of HMC specifically designed to handle noisy gradients from mini-batches. Introduced by Chen, Fox, and Guestrin (2014), SGHMC modifies the Hamiltonian dynamics equations to account for the gradient noise, ensuring convergence to the correct target distribution under certain conditions.
Recall HMC simulates a physical system with position w (weights) and momentum p. The dynamics are governed by the Hamiltonian H(w,p)=U(w)+K(p), where U(w)=−logp(w∣D) is the potential energy (negative log posterior) and K(p) is the kinetic energy.
SGHMC modifies the momentum update step. Instead of directly using the noisy gradient estimate ∇wU~(w) (where U~(w) is the potential energy estimated using a mini-batch), SGHMC introduces a friction term C. The discretized update equations for a step size ϵ look like this:
-
Position update:
wt+1=wt+ϵM−1pt
(Here M−1 is the inverse mass matrix, often taken as identity).
-
Momentum update:
pt+1=pt−ϵ∇wU~(wt)−ϵCM−1pt+N(0,2ϵ(C−B^))
Let's break down the momentum update:
- −ϵ∇wU~(wt): The standard force term, but using the stochastic gradient of the potential energy.
- −ϵCM−1pt: The added friction term. This term dampens the momentum, helping to counteract the noise injected by the stochastic gradient. C is a user-specified positive definite matrix (often diagonal, C=αI). It acts similarly to momentum decay in optimization algorithms like SGD with momentum.
- N(0,2ϵ(C−B^)): A Gaussian noise term added to the momentum update. This is essential. B^ is an estimate of the covariance of the noise in the stochastic gradient estimate. This injected noise compensates exactly for the effects of discretization and the friction term, ensuring that the sampler's stationary distribution remains the target posterior p(w∣D). In practice, estimating B^ can be complex, and often it's assumed to be zero or a simple scalar, requiring adjustment of C.
The introduction of the friction term C is the main adaptation. It helps to control the variance introduced by the noisy gradients, stabilizing the simulation. The additional noise term corrects the dynamics to ensure theoretical guarantees of converging to the correct posterior distribution.
Practical Considerations for SGHMC
Implementing SGHMC for BNNs involves several choices:
- Mini-batch Size (M): Larger batches reduce gradient noise but increase computation per step. A balance must be found.
- Step Size (ϵ): Similar to learning rates in SGD. Needs careful tuning. Too large leads to instability; too small leads to slow exploration. Techniques like cyclical step sizes can sometimes help.
- Friction Term (C): Controls the damping. Higher friction leads to more stability but potentially slower exploration of the state space. It needs to be balanced with the estimated noise covariance B^ (if used) and the step size. Often tuned empirically.
- Integration Steps: Like HMC, SGHMC involves simulating dynamics for a certain number of steps (leapfrog steps) before generating a sample. The number of steps affects computational cost and exploration efficiency.
- Burn-in and Thinning: Standard MCMC practices apply. Initial samples (burn-in) are discarded, and samples might be thinned to reduce autocorrelation.
- Convergence Diagnostics: Assessing convergence is more challenging with stochastic gradients compared to standard MCMC. Standard diagnostics like R^ might be less reliable. Observing trace plots of key parameters or model performance metrics over iterations is important.
Other Stochastic Gradient MCMC Methods
SGHMC is a prominent method, but others exist:
- Stochastic Gradient Langevin Dynamics (SGLD): A simpler method that essentially adds Gaussian noise to SGD updates. It corresponds to the SGHMC dynamics when friction C is high (overdamped limit). It can be easier to implement but may mix slower than SGHMC in some scenarios.
Δwt=2ϵt(∇wlogp(wt)+MNi∈It∑∇wlogp(yi∣xi,wt))+ηt
where ηt∼N(0,ϵt). The step size ϵt must decay over time for convergence.
Advantages and Disadvantages
Advantages of SGMCMC for BNNs:
- Scalability: Enables MCMC sampling for large datasets and complex models where standard MCMC is infeasible.
- Full Posterior Characterization (in theory): Unlike VI, MCMC methods aim to sample from the true posterior, potentially capturing complex shapes and multi-modality better, given enough time.
- Theoretical Grounding: Builds upon the well-established theory of MCMC.
Disadvantages:
- Tuning Complexity: Requires careful tuning of step size, friction, mini-batch size, and potentially noise estimation. Poor tuning can lead to slow convergence or divergence.
- Computational Cost: While more scalable than standard MCMC, SGMCMC methods are still generally much slower than VI or standard gradient-based optimization for training deep networks. Generating many independent samples can take significant time.
- Convergence Diagnostics: Assessing convergence robustly remains an open area of research and can be difficult in practice.
In summary, stochastic gradient MCMC methods like SGHMC and SGLD provide a way to apply the principles of MCMC sampling to the challenging domain of Bayesian deep learning. They offer a path towards obtaining samples from the BNN posterior distribution, enabling richer uncertainty quantification than point estimates, but require careful implementation and tuning due to the noisy gradient estimates inherent in mini-batch processing. They represent a computationally intensive but potentially more accurate alternative to the variational methods discussed next.