While MCMC methods provide asymptotically exact samples from the posterior distribution p(w∣D) of a Bayesian Neural Network's weights w, they often struggle with the high dimensionality and large datasets typical in deep learning. The computational cost per iteration and the number of iterations required for convergence can become prohibitive. This is where Variational Inference (VI) offers a compelling alternative. Instead of sampling, VI reframes Bayesian inference as an optimization problem, seeking an approximate distribution qϕ(w) from a tractable family that is closest to the true posterior p(w∣D), typically measured by minimizing the Kullback-Leibler (KL) divergence KL[qϕ(w)∣∣p(w∣D)].
As we saw in Chapter 3, minimizing this KL divergence is equivalent to maximizing the Evidence Lower Bound (ELBO):
L(ϕ)=Eqϕ(w)[logp(D∣w)]−KL[qϕ(w)∣∣p(w)]Here, p(D∣w) is the likelihood of the data given the weights, p(w) is the prior distribution over the weights, and qϕ(w) is the variational approximation parameterized by ϕ. The first term encourages the approximate posterior to explain the data well, while the second term acts as a regularizer, keeping the approximation close to the prior.
However, applying traditional VI methods like Coordinate Ascent Variational Inference (CAVI) directly to BNNs is difficult due to the complex, non-conjugate relationships between the weights and the data introduced by non-linear activation functions and the sheer number of parameters.
A breakthrough technique for applying VI to BNNs is Weight Uncertainty in Neural Networks, often referred to as Bayes by Backprop (BBB), proposed by Blundell et al. (2015). The core idea is to optimize the ELBO with respect to the variational parameters ϕ using stochastic gradient ascent, much like standard deep learning models are trained using backpropagation.
The main challenge lies in calculating the gradient of the ELBO, specifically the expectation term Eqϕ(w)[logp(D∣w)], because the expectation is taken with respect to qϕ(w), and the parameters ϕ we want to differentiate are inside this distribution. Taking gradients of expectations is tricky.
Bayes by Backprop cleverly sidesteps this issue using the reparameterization trick. Instead of directly sampling weights w from qϕ(w), we introduce an auxiliary noise variable ϵ with a fixed distribution (e.g., standard Gaussian, ϵ∼N(0,I)) and define a deterministic transformation w=g(ϕ,ϵ) such that the resulting w has the distribution qϕ(w).
For a common choice of qϕ(w) as a diagonal Gaussian distribution, where ϕ={μ,ρ} represents the mean vector μ and parameters ρ related to the standard deviation vector σ (often σ=log(1+exp(ρ)) for positivity), the reparameterization is simple:
w=μ+σ⊙ϵ=μ+log(1+exp(ρ))⊙ϵ,where ϵ∼N(0,I)Now, the expectation in the ELBO can be rewritten with respect to the fixed distribution of ϵ:
Eqϕ(w)[logp(D∣w)]=Eϵ∼p(ϵ)[logp(D∣g(ϕ,ϵ))]The gradient ∇ϕ can now be pushed inside the expectation using the chain rule, as g(ϕ,ϵ) is deterministic given ϕ and ϵ:
∇ϕEqϕ(w)[logp(D∣w)]=Eϵ∼p(ϵ)[∇ϕlogp(D∣g(ϕ,ϵ))]This expectation is typically approximated using Monte Carlo sampling. For a mini-batch of data Di, we sample one (or a few) ϵ values, compute the corresponding w=g(ϕ,ϵ), calculate the log-likelihood logp(Di∣w), and then compute the gradients ∇ϕlogp(Di∣w) using standard backpropagation through the deterministic transformation g and the network itself.
Combining the likelihood term and the KL divergence term, the objective function to maximize (the ELBO) for a single data point (x,y) becomes:
L(ϕ)≈logp(y∣x,w)−KL[qϕ(w)∣∣p(w)]where w=g(ϕ,ϵ) is sampled using the reparameterization trick. The KL divergence term KL[qϕ(w)∣∣p(w)] can often be computed analytically if qϕ(w) and p(w) are chosen conveniently (e.g., both Gaussians).
The loss function used in training is typically the negative ELBO, scaled for the full dataset or mini-batches:
Loss=−i=1∑N(Eqϕ(w)[logp(yi∣xi,w)]−N1KL[qϕ(w)∣∣p(w)])In practice with mini-batches of size M, the loss is approximated as:
Lossmini-batch≈−j=1∑Mlogp(yj∣xj,wj)+B1KL[qϕ(w)∣∣p(w)]Where wj=g(ϕ,ϵj) is sampled independently for each data point in the mini-batch (or a single sample w is used for the whole batch), and B is the number of mini-batches in the dataset (scaling the KL term appropriately). The gradients of this loss with respect to ϕ (the means μ and variance parameters ρ) are computed via backpropagation and used to update ϕ using optimizers like Adam or RMSprop.
The Bayes by Backprop training process. Variational parameters ϕ (means μ and variance parameters ρ) define the approximate posterior qϕ(w). In each step, noise ϵ is sampled to generate network weights w via reparameterization. A forward pass computes the predicted output for the mini-batch data. The loss (negative ELBO) combines the log-likelihood of the data and the KL divergence between the approximate posterior qϕ(w) and the prior p(w). Gradients of the loss with respect to ϕ are computed using backpropagation, and ϕ is updated using an optimizer.
Advantages:
Limitations:
Compared to Monte Carlo Dropout (discussed in a later section), Bayes by Backprop represents a more principled and flexible VI approach, explicitly defining and optimizing parameters for the approximate posterior distribution over weights. While often more computationally intensive than MC Dropout, it allows for more control over the prior and the variational approximation, potentially leading to more accurate uncertainty estimates if the optimization succeeds and the variational family is adequate.
© 2025 ApX Machine Learning