Alright, we've established the goal: treat the weights w of a neural network not as fixed points, but as random variables with a distribution. We define a prior p(w) and use the data D to find the posterior p(w∣D) via Bayes' theorem:
p(w∣D)=p(D)p(D∣w)p(w)Conceptually straightforward. However, moving from this definition to a practical implementation for deep neural networks presents significant computational and analytical hurdles. Why is obtaining or even approximating this posterior distribution so difficult? Let's examine the primary challenges.
Modern deep neural networks are characterized by having a vast number of parameters. Even moderately sized architectures can easily contain millions or even billions of weights and biases. This high dimensionality of the parameter space w is the first major obstacle.
Consider the denominator in Bayes' theorem, the evidence or marginal likelihood:
p(D)=∫p(D∣w)p(w)dwThis is an integral over all possible configurations of the network's weights. In a space with millions or billions of dimensions, evaluating this integral analytically is impossible except for the most trivial cases. Numerical integration methods, such as grid-based quadrature, scale exponentially with the number of dimensions and become computationally infeasible almost immediately. Monte Carlo integration, which we'll explore later, offers a potential path forward, but the sheer scale of the space makes efficient sampling difficult.
Furthermore, the posterior distribution p(w∣D) itself lives in this high-dimensional space. Visualizing or characterizing such a distribution is non-trivial. It's likely to be highly complex, potentially multi-modal (having multiple distinct peaks corresponding to different plausible network configurations), and have intricate correlation structures between parameters.
As highlighted above, the evidence p(D) is the normalizing constant for the posterior distribution. Its intractability means we cannot compute the exact value of the posterior density p(w∣D) for any given w. We can often evaluate the numerator, p(D∣w)p(w), which is proportional to the posterior (sometimes called the "unnormalized posterior"), but without the denominator, we don't have the true probability density.
This inability to calculate the normalizing constant prevents direct sampling from the posterior and makes direct optimization or analysis challenging. Many inference techniques are designed specifically to work around this limitation, either by finding ways to sample from the distribution without knowing p(D) (like MCMC) or by approximating the posterior with a tractable distribution (like Variational Inference).
The likelihood function p(D∣w) for a typical dataset and neural network architecture is highly non-convex with respect to the weights w. This means the "surface" defined by the likelihood (and consequently, the posterior) in the high-dimensional weight space is rugged, containing numerous local optima, saddle points, and potentially large flat regions.
Even approximate inference methods designed to handle the previous challenges remain computationally expensive, especially for large datasets and deep architectures.
These challenges underscore why simply applying standard Bayesian techniques directly to deep learning models isn't feasible. The scale of the models and the complexity of their objective functions necessitate specialized, advanced inference techniques. The following sections will introduce MCMC and VI methods specifically adapted to navigate these difficulties in the context of Bayesian Neural Networks.
© 2025 ApX Machine Learning