As introduced, Bayes' Theorem provides a principled way to update our beliefs (prior P(θ)) in light of new evidence (data D) to form updated beliefs (posterior P(θ∣D)):
P(θ∣D)=P(D)P(D∣θ)P(θ)
While elegant, applying this theorem encounters significant practical and conceptual hurdles when the parameter space θ is high-dimensional. This is the norm in modern machine learning, where models might have thousands, millions, or even billions of parameters (e.g., deep neural networks, complex graphical models). Let's examine how high dimensionality affects each component of the theorem and the resulting posterior.
The Curse of Dimensionality in Probability Space
The term "curse of dimensionality" refers to various phenomena that arise when analyzing data in high-dimensional spaces. In the context of Bayesian inference:
- Volume Concentration: High-dimensional spaces behave counter-intuitively. Most of the volume of a high-dimensional hypersphere is concentrated near its surface, not near its center. Similarly, the mass of a high-dimensional Gaussian distribution is concentrated in a thin "shell" away from the mean. This means random samples drawn from a high-dimensional distribution are unlikely to be near the mode.
- Sparsity: As dimensionality increases, the available data become comparatively sparse. To maintain the same density of data points as dimensionality grows, the amount of data required increases exponentially. In high-dimensional parameter spaces, our data D might only provide information about a small subspace, leaving many parameter dimensions poorly constrained by the likelihood.
- Computational Complexity: Operations like integration and optimization become computationally demanding as the number of dimensions increases.
Impact on the Prior P(θ)
Specifying a meaningful prior P(θ) becomes considerably harder in high dimensions.
- Intuition Fails: Our geometric intuition, largely based on 2D and 3D, breaks down. A prior that seems reasonable in low dimensions (like a uniform prior over a hypercube) might have unintended consequences in high dimensions, placing most probability mass in the corners or behaving unexpectedly.
- Specification Difficulty: Choosing truly "uninformative" or "objective" priors is challenging. Standard choices like broad Gaussians might still inadvertently encode strong assumptions when extended to many dimensions. Hierarchical priors, where prior parameters (hyperparameters) are themselves given distributions, become more important for borrowing strength across dimensions but add layers of complexity.
- Prior Sensitivity: The posterior can become more sensitive to prior choices in directions where the likelihood provides little information, a common occurrence in high dimensions.
Impact on the Likelihood P(D∣θ)
The likelihood function P(D∣θ), viewed as a function of θ for fixed data D, also behaves differently.
- Complex Geometry: High-dimensional likelihood surfaces can be highly multi-modal or exhibit complex ridge structures, making exploration difficult. Finding the maximum likelihood estimate (MLE) itself can be challenging.
- Computational Cost: Evaluating the likelihood for complex models (like deep networks or large graphical models) given high-dimensional θ can be computationally expensive for each step of an inference algorithm.
Impact on the Posterior P(θ∣D)
The combination of high-dimensional priors and likelihoods leads to complex posteriors.
- Geometric Complexity: The posterior distribution P(θ∣D) inherits the complexities of the prior and likelihood. It often concentrates in a tiny fraction of the parameter space, potentially in regions that are difficult to find or characterize. It might exhibit strong correlations between parameters, making simple summaries misleading.
- The Intractable Normalizer: The evidence, P(D)=∫P(D∣θ)P(θ)dθ, is the cornerstone of the computational challenge. This integral is taken over the entire high-dimensional parameter space θ. Except for conjugate prior-likelihood pairs (which are rare in complex models), this integral is analytically intractable and computationally infeasible to calculate accurately using standard numerical integration techniques like quadrature, whose complexity scales exponentially with dimension.
Consider a simple visualization of how probability mass concentrates differently in high dimensions. Imagine sampling from a standard multivariate Gaussian distribution N(0,I) in d dimensions.
Illustration of probability mass concentration. In low dimensions (left), samples from a Gaussian are often near the mode (center). In high dimensions (right), most of the probability mass concentrates in a thin shell far from the mode.
- Approximation Necessity: Because calculating the exact posterior P(θ∣D) is usually impossible due to the intractable normalization constant P(D), we must resort to approximation methods. This is the primary motivation for the advanced inference techniques covered in Chapters 2 (Markov Chain Monte Carlo) and 3 (Variational Inference). These methods bypass the direct calculation of P(D).
Understanding these high-dimensional characteristics is not just a theoretical exercise. It directly informs how we build models, choose priors, select inference algorithms, and interpret the results in practical, large-scale machine learning applications. The apparent simplicity of Bayes' Theorem hides significant depth when applied to the complex, high-dimensional problems typical in AI.