Standard deep learning models provide point estimates for predictions, offering no inherent measure of their confidence. This can be problematic in applications where understanding the reliability of a prediction is important. Bayesian Neural Networks (BNNs), as introduced earlier in this chapter, address this by learning a distribution over the model parameters (weights and biases, w) rather than single point estimates. This posterior distribution, p(w∣D), where D represents the training data, is the foundation for quantifying uncertainty in BNN predictions.
Understanding and quantifying uncertainty allows us to build more trustworthy and informative models. In the context of BNNs, uncertainty isn't a single concept but typically decomposes into two primary types:
Aleatoric uncertainty, sometimes called data uncertainty, captures the inherent noise or randomness in the data generating process itself. It represents uncertainty that cannot be reduced even if we had infinite data, because it stems from irreducible variability in the observations.
Consider a regression task where multiple different output values y are possible for the exact same input features x. This variability is aleatoric. In BNNs, aleatoric uncertainty is typically modeled via the likelihood function, p(y∣x,w). For example, using a Gaussian likelihood for regression:
p(y∣x,fw(x))=N(y∣fw(x),σ2)Here, fw(x) is the output of the neural network with parameters w, and σ2 represents the observation noise variance.
Epistemic uncertainty, also known as model uncertainty or knowledge uncertainty, reflects our ignorance about the true model parameters. It captures the uncertainty arising from having limited data to constrain the posterior distribution p(w∣D). As we gather more relevant data, epistemic uncertainty should decrease because the posterior distribution becomes more concentrated around the optimal parameter values.
In BNNs, epistemic uncertainty arises directly from the fact that we have a distribution over weights p(w∣D) instead of a single set of weights. Different plausible weight configurations (sampled from the posterior) will produce different predictions for the same input x. The variation in these predictions reflects the model's uncertainty about the function it has learned.
To obtain a prediction for a new input x∗, we are interested in the predictive distribution p(y∗∣x∗,D). This involves marginalizing out the model parameters:
p(y∗∣x∗,D)=∫p(y∗∣x∗,w)p(w∣D)dwThis integral averages the predictions p(y∗∣x∗,w) over all possible parameter values weighted by their posterior probability p(w∣D). The variance of this predictive distribution captures the total uncertainty, encompassing both aleatoric and epistemic sources.
Since the posterior p(w∣D) is generally intractable, we rely on the approximation methods discussed previously:
MCMC Methods: Techniques like Stochastic Gradient HMC provide samples w(1),w(2),...,w(S) from the posterior p(w∣D). We can approximate the predictive distribution by generating predictions for each sample and forming an empirical distribution:
p(y∗∣x∗,D)≈S1s=1∑Sp(y∗∣x∗,w(s))For regression with a Gaussian likelihood, this means obtaining S predicted means fw(s)(x∗) and potentially variances σw(s)2(x∗). The mean of the predictive distribution can be estimated by S1∑sfw(s)(x∗), and the variance (total uncertainty) can be estimated from the variance of the samples {fw(s)(x∗)}s=1S plus the average predicted aleatoric variance.
Variational Inference (VI): VI methods like Bayes by Backprop learn an approximate posterior q(w;ϕ). The predictive distribution is approximated as:
p(y∗∣x∗,D)≈∫p(y∗∣x∗,w)q(w;ϕ)dwThis integral is often approximated using Monte Carlo sampling: draw samples w(s)∼q(w;ϕ) and average the predictions p(y∗∣x∗,w(s)) similar to the MCMC approach.
Monte Carlo Dropout: A computationally cheaper, widely used technique involves training a standard neural network with dropout applied before weight layers. At test time, dropout is kept active, and multiple forward passes (S times) are performed for the same input x∗. Each forward pass yields a different output due to the stochastic nature of dropout. This set of outputs {fw(s)(x∗)}s=1S (where w(s) represents the network configuration in the s-th pass) can be treated as samples from an approximate predictive distribution. This method approximates Bayesian inference in deep Gaussian processes.
The spread (e.g., variance or interquartile range) of the approximated predictive distribution gives us a measure of the model's confidence. High variance suggests low confidence.
Knowing the source of uncertainty is valuable. High epistemic uncertainty suggests the model is unsure due to lack of data in that region of the input space, indicating where collecting more data might be beneficial. High aleatoric uncertainty suggests inherent limits to predictability based on the given features and noise level.
Example of a BNN's predictive mean and total uncertainty (shaded region, e.g., Mean +/- 2 Std Dev) for a 1D regression problem. Uncertainty is lower near training data points and higher in regions far from the data, reflecting increased epistemic uncertainty.
Reliable uncertainty estimates from BNNs are beneficial in numerous areas:
By providing a principled way to represent and quantify what the model doesn't know, BNNs offer a significant advantage over standard deep learning models, enabling more robust and trustworthy AI systems.
© 2025 ApX Machine Learning