Standard neural networks learn a single set of optimal parameters, typically weights and biases, often found by minimizing a loss function using techniques like gradient descent. These represent point estimates in the vast space of possible parameters. While effective for prediction, this approach doesn't inherently capture the uncertainty associated with these parameter values or the resulting predictions. If the training data is limited or noisy, multiple different sets of weights might explain the data almost equally well. A standard network picks just one, potentially becoming overconfident in its predictions.
Bayesian Neural Networks (BNNs) address this by embracing the Bayesian philosophy: treat the network parameters (weights w and biases, collectively denoted as w) not as fixed values to be optimized, but as random variables. Instead of finding a single best set of weights w^, our goal in a BNN is to infer the full posterior distribution over the weights given the observed data D: p(w∣D). This posterior distribution captures our belief about plausible parameter values after seeing the data, inherently quantifying the uncertainty associated with them.
The foundation of any Bayesian model is the prior distribution, p(w). This distribution encodes our beliefs about the parameters before observing any data. In the context of BNNs, the prior is placed over all the weights and biases in the network.
What does it mean to have a belief about weights? Often, we don't have strong prior information about specific weight values. A common and pragmatic choice is to use a simple, mathematically convenient prior that reflects general assumptions. A very frequent choice is an independent Gaussian prior centered at zero for each weight wi:
wi∼N(0,σp2)This implies a prior for the entire weight vector w:
p(w)=i∏N(wi∣0,σp2)Here, σp2 is the prior variance, a hyperparameter. This prior expresses a belief that weights are likely to be small and centered around zero. Larger values of σp2 correspond to a "weaker" prior, allowing weights to deviate further from zero, while smaller values impose a stronger constraint, pushing weights towards zero.
Choosing a zero-mean Gaussian prior has a direct connection to L2 regularization (weight decay) used in standard neural network training. Recall that L2 regularization adds a penalty term λ∑iwi2 to the loss function. Maximizing the posterior probability (MAP estimation) under a Gaussian prior is equivalent to minimizing a loss function with an L2 penalty, where the regularization strength λ is related to the prior variance σp2. However, the full Bayesian approach goes beyond MAP; we aim to characterize the entire posterior distribution p(w∣D), not just find its mode. The prior influences the shape of this entire distribution.
Other priors are possible:
For now, we'll often assume a simple Gaussian prior due to its simplicity and connection to weight decay.
The likelihood function, p(D∣w), describes the probability of observing the dataset D={(xn,yn)}n=1N given a specific setting of the network parameters w. This is conceptually the same as in standard neural networks and is determined by the network architecture and the assumed distribution of the output.
Regression: If we assume the target variable y follows a Gaussian distribution around the network's output f(x;w) with noise variance σ2, the likelihood for a single data point (xn,yn) is N(yn∣f(xn;w),σ2). The likelihood for the entire dataset (assuming independence) is the product:
p(D∣w)=n=1∏NN(yn∣f(xn;w),σ2)Minimizing the negative log-likelihood −logp(D∣w) in this case corresponds to minimizing the Mean Squared Error (MSE) loss, plus a term related to the noise variance σ2.
Classification: For classification, the network output typically represents probabilities (e.g., via a softmax layer). If yn is a one-hot encoded label, the likelihood for a single data point is often modeled using a Categorical distribution, where the probabilities are given by the network output f(xn;w).
p(yn∣xn,w)=Categorical(yn∣f(xn;w))Minimizing the negative log-likelihood −logp(D∣w) here corresponds to minimizing the cross-entropy loss.
The choice of likelihood links the abstract network parameters w to the actual data D.
Having defined the prior p(w) and the likelihood p(D∣w), we can combine them using Bayes' Theorem to obtain the posterior distribution over the weights:
p(w∣D)=p(D)p(D∣w)p(w)The diagram below contrasts the standard and Bayesian approaches:
Comparison between a standard neural network learning point estimates for weights and a Bayesian neural network learning posterior distributions over weights.
The crucial challenge in BNNs lies in the computation of the posterior p(w∣D). For deep neural networks with potentially millions of parameters, the integral required for the marginal likelihood p(D) is intractable. Furthermore, the posterior distribution itself is high-dimensional and complex, making direct calculation impossible.
Therefore, instead of calculating the exact posterior, we must rely on approximation methods. The subsequent sections in this chapter will explore two main families of techniques for tackling this challenge: Markov Chain Monte Carlo (MCMC) methods and Variational Inference (VI). These methods allow us to either draw samples from the posterior distribution or find an approximation to it, enabling us to leverage the Bayesian framework for deep learning tasks, particularly for robust uncertainty quantification.
© 2025 ApX Machine Learning