Information theory offers a powerful lens through which to view and quantify aspects of Bayesian inference. At its core, Bayesian learning is about updating our beliefs (represented by probability distributions) in light of new data. Information theory provides the tools to measure the uncertainty inherent in these beliefs and the "distance" or divergence between different belief states. Two fundamental concepts here are entropy and Kullback-Leibler (KL) divergence.
Shannon entropy measures the average level of "information", "surprise", or "uncertainty" inherent in a random variable's possible outcomes. For a discrete random variable X with probability mass function p(x), the entropy H(X) is defined as:
H(X)=−x∈X∑p(x)logp(x)For a continuous random variable X with probability density function p(x), the differential entropy is:
H(X)=−∫−∞∞p(x)logp(x)dxThe base of the logarithm determines the units (base 2 gives bits, base e gives nats). Higher entropy implies greater uncertainty about the outcome of X. A distribution sharply peaked around a single value has low entropy, while a uniform distribution over a wide range has high entropy.
In Bayesian modeling:
While entropy measures the uncertainty of a single distribution, KL divergence quantifies how one probability distribution P differs from a second, reference probability distribution Q. It's often interpreted as the information lost when Q is used to approximate P, or the relative entropy of P with respect to Q.
For discrete distributions P(x) and Q(x):
DKL(P∣∣Q)=x∈X∑p(x)logq(x)p(x)For continuous distributions p(x) and q(x):
DKL(p∣∣q)=∫−∞∞p(x)logq(x)p(x)dxKey properties of KL divergence:
The asymmetry has important implications. Minimizing DKL(P∣∣Q) encourages Q to be non-zero wherever P is non-zero (it tries to cover P). Minimizing DKL(Q∣∣P) encourages Q to be zero wherever P is zero (it tries to be contained within P).
KL divergence DKL(P∣∣Q) quantifies the difference between distribution P (blue) and distribution Q (pink). It measures the inefficiency of using Q when the true distribution is P. Note the asymmetry, DKL(P∣∣Q) would yield a different value than DKL(Q∣∣P).
The connection between information theory and Bayesian methods becomes particularly apparent when dealing with the computational challenges mentioned earlier, especially intractable posterior distributions P(θ∣D).
Variational Inference (VI): This is a major application area for KL divergence. VI reframes Bayesian inference as an optimization problem. We seek an approximation Q(θ) from a tractable family of distributions (e.g., Gaussians) that is "closest" to the true, often intractable, posterior P(θ∣D). "Closest" is typically measured using KL divergence. Specifically, VI aims to minimize DKL(Q(θ)∣∣P(θ∣D)). Directly minimizing this is still hard because it involves the unknown posterior. However, minimizing this KL divergence is equivalent to maximizing a quantity called the Evidence Lower Bound (ELBO):
ELBO(Q)=EQ[logP(D,θ)]−EQ[logQ(θ)]Maximizing the ELBO pushes Q(θ) to be close to the true posterior P(θ∣D) in the KL sense. We will explore this relationship in detail when we cover Variational Inference techniques in Chapter 3.
Model Comparison and Selection: While formal Bayesian model comparison often relies on the marginal likelihood P(D) or Bayes factors, information criteria like AIC (Akaike Information Criterion) and DIC (Deviance Information Criterion) have connections to KL divergence. They provide ways to estimate a model's expected out-of-sample predictive accuracy, implicitly considering a balance between model fit and complexity, concepts related to information gain and distribution divergence.
In summary, entropy provides a way to quantify the uncertainty in our Bayesian models' priors and posteriors. KL divergence provides a fundamental tool for comparing probability distributions, forming the mathematical bedrock for approximation techniques like Variational Inference, which are essential for applying Bayesian methods to complex, high-dimensional problems where exact computation of the posterior is infeasible. Understanding these information-theoretic measures is therefore not just a theoretical exercise; it's foundational for implementing and interpreting many advanced Bayesian techniques.
© 2025 ApX Machine Learning