While MCMC methods provide asymptotically exact samples from the posterior distribution p(w∣D) of a Bayesian Neural Network's weights w, they often struggle with the high dimensionality and large datasets typical in deep learning. The computational cost per iteration and the number of iterations required for convergence can become prohibitive. This is where Variational Inference (VI) offers a compelling alternative. Instead of sampling, VI reframes Bayesian inference as an optimization problem, seeking an approximate distribution qϕ(w) from a tractable family that is closest to the true posterior p(w∣D), typically measured by minimizing the Kullback-Leibler (KL) divergence KL[qϕ(w)∣∣p(w∣D)].
As we saw in Chapter 3, minimizing this KL divergence is equivalent to maximizing the Evidence Lower Bound (ELBO):
L(ϕ)=Eqϕ(w)[logp(D∣w)]−KL[qϕ(w)∣∣p(w)]
Here, p(D∣w) is the likelihood of the data given the weights, p(w) is the prior distribution over the weights, and qϕ(w) is the variational approximation parameterized by ϕ. The first term encourages the approximate posterior to explain the data well, while the second term acts as a regularizer, keeping the approximation close to the prior.
However, applying traditional VI methods like Coordinate Ascent Variational Inference (CAVI) directly to BNNs is difficult due to the complex, non-conjugate relationships between the weights and the data introduced by non-linear activation functions and the sheer number of parameters.
Bayes by Backprop: Optimizing the ELBO with Gradients
A breakthrough technique for applying VI to BNNs is Weight Uncertainty in Neural Networks, often referred to as Bayes by Backprop (BBB), proposed by Blundell et al. (2015). The core idea is to optimize the ELBO with respect to the variational parameters ϕ using stochastic gradient ascent, much like standard deep learning models are trained using backpropagation.
The main challenge lies in calculating the gradient of the ELBO, specifically the expectation term Eqϕ(w)[logp(D∣w)], because the expectation is taken with respect to qϕ(w), and the parameters ϕ we want to differentiate are inside this distribution. Taking gradients of expectations is tricky.
The Reparameterization Trick
Bayes by Backprop cleverly sidesteps this issue using the reparameterization trick. Instead of directly sampling weights w from qϕ(w), we introduce an auxiliary noise variable ϵ with a fixed distribution (e.g., standard Gaussian, ϵ∼N(0,I)) and define a deterministic transformation w=g(ϕ,ϵ) such that the resulting w has the distribution qϕ(w).
For a common choice of qϕ(w) as a diagonal Gaussian distribution, where ϕ={μ,ρ} represents the mean vector μ and parameters ρ related to the standard deviation vector σ (often σ=log(1+exp(ρ)) for positivity), the reparameterization is simple:
w=μ+σ⊙ϵ=μ+log(1+exp(ρ))⊙ϵ,where ϵ∼N(0,I)
Now, the expectation in the ELBO can be rewritten with respect to the fixed distribution of ϵ:
Eqϕ(w)[logp(D∣w)]=Eϵ∼p(ϵ)[logp(D∣g(ϕ,ϵ))]
The gradient ∇ϕ can now be pushed inside the expectation using the chain rule, as g(ϕ,ϵ) is deterministic given ϕ and ϵ:
∇ϕEqϕ(w)[logp(D∣w)]=Eϵ∼p(ϵ)[∇ϕlogp(D∣g(ϕ,ϵ))]
This expectation is typically approximated using Monte Carlo sampling. For a mini-batch of data Di, we sample one (or a few) ϵ values, compute the corresponding w=g(ϕ,ϵ), calculate the log-likelihood logp(Di∣w), and then compute the gradients ∇ϕlogp(Di∣w) using standard backpropagation through the deterministic transformation g and the network itself.
The Full Objective and Gradient
Combining the likelihood term and the KL divergence term, the objective function to maximize (the ELBO) for a single data point (x,y) becomes:
L(ϕ)≈logp(y∣x,w)−KL[qϕ(w)∣∣p(w)]
where w=g(ϕ,ϵ) is sampled using the reparameterization trick. The KL divergence term KL[qϕ(w)∣∣p(w)] can often be computed analytically if qϕ(w) and p(w) are chosen conveniently (e.g., both Gaussians).
The loss function used in training is typically the negative ELBO, scaled for the full dataset or mini-batches:
Loss=−i=1∑N(Eqϕ(w)[logp(yi∣xi,w)]−N1KL[qϕ(w)∣∣p(w)])
In practice with mini-batches of size M, the loss is approximated as:
Lossmini-batch≈−j=1∑Mlogp(yj∣xj,wj)+B1KL[qϕ(w)∣∣p(w)]
Where wj=g(ϕ,ϵj) is sampled independently for each data point in the mini-batch (or a single sample w is used for the whole batch), and B is the number of mini-batches in the dataset (scaling the KL term appropriately). The gradients of this loss with respect to ϕ (the means μ and variance parameters ρ) are computed via backpropagation and used to update ϕ using optimizers like Adam or RMSprop.
The Bayes by Backprop training process. Variational parameters ϕ (means μ and variance parameters ρ) define the approximate posterior qϕ(w). In each step, noise ϵ is sampled to generate network weights w via reparameterization. A forward pass computes the predicted output for the mini-batch data. The loss (negative ELBO) combines the log-likelihood of the data and the KL divergence between the approximate posterior qϕ(w) and the prior p(w). Gradients of the loss with respect to ϕ are computed using backpropagation, and ϕ is updated using an optimizer.
Practical Notes
- Choice of Prior p(w): A standard Gaussian prior (N(0,σp2I)) is common, simplifying the KL divergence calculation when qϕ(w) is also Gaussian. More complex priors can be used but might require numerical estimation of the KL term.
- Initialization: Initializing the means μ similarly to standard network weights and keeping initial variances (controlled by ρ) small often helps stability.
- Variance Parameterization: Using ρ such that σ=log(1+exp(ρ)) ensures σ is always positive.
- Gradient Variance: The stochastic nature of the gradient estimates (due to sampling ϵ) can lead to high variance during training. Techniques like using more Monte Carlo samples per step or variance reduction methods (like the local reparameterization trick, which applies reparameterization to activations instead of weights) can help, though potentially at increased computational cost.
- Computational Cost: While more scalable than MCMC, VI for BNNs is still more computationally expensive than training standard deterministic NNs. Each forward/backward pass involves sampling weights and calculating the KL term, roughly doubling the number of parameters (μ and ρ for each weight) and adding complexity to the computation graph.
Advantages and Limitations of VI for BNNs
Advantages:
- Scalability: Uses mini-batch stochastic gradient optimization, making it applicable to large datasets and complex models where MCMC is infeasible.
- Compatibility: Integrates relatively well with existing deep learning frameworks and hardware acceleration (GPUs/TPUs).
- Single Optimization: Provides a point estimate of the variational parameters ϕ after a single optimization run, unlike MCMC which requires multiple chains and convergence checks.
Limitations:
- Approximation Quality: The accuracy of the posterior approximation is limited by the chosen variational family qϕ(w). Simple families like mean-field Gaussians might not capture complex dependencies in the true posterior.
- Optimization Difficulties: Optimizing the ELBO can be challenging due to noisy gradients and potential local optima.
- Underestimation of Variance: VI is known to sometimes underestimate the variance of the posterior distribution, potentially leading to overconfident predictions.
Compared to Monte Carlo Dropout (discussed in a later section), Bayes by Backprop represents a more principled and flexible VI approach, explicitly defining and optimizing parameters for the approximate posterior distribution over weights. While often more computationally intensive than MC Dropout, it allows for more control over the prior and the variational approximation, potentially leading to more accurate uncertainty estimates if the optimization succeeds and the variational family is adequate.