As we established, the prior distribution P(θ) in Bayes' theorem represents our beliefs about the parameters before observing the data D. Choosing this prior is a fundamental step in Bayesian modeling, and the approach can range from encoding specific domain knowledge to attempting neutrality. This choice influences the resulting posterior P(θ∣D), sometimes subtly, sometimes substantially, especially when data is limited or dimensions are high. Let's examine the different philosophies guiding prior selection.
A subjective prior incorporates specific, pre-existing information or beliefs about the parameter θ. This information might come from previous experiments, expert opinion, or physical constraints of the system being modeled.
For instance, if modeling the probability ϕ of a coin landing heads, prior knowledge might suggest the coin is likely fair. A subjective prior could be a Beta distribution centered around 0.5 with a relatively small variance, like Beta(10, 10). This prior assigns higher probability density near ϕ=0.5 and less density towards extreme values like 0 or 1.
Advantages:
Disadvantages:
In advanced settings, subjective priors are often constructed hierarchically, allowing parameters to inform each other, which we'll touch upon later. The key is transparency: state your prior and the reasoning behind it.
Objective priors, sometimes called non-informative or reference priors, attempt to minimize the prior's influence on the posterior distribution. The goal is to "let the data speak for itself" as much as possible. This doesn't mean the prior has no influence, but rather that its influence is minimized according to some formal criterion.
Common approaches include:
Uniform Priors: Assigning equal probability density across the possible range of the parameter. For a parameter constrained between a and b, P(θ)∝1 for θ∈[a,b]. However, uniformity is not preserved under non-linear parameter transformations (e.g., a uniform prior on standard deviation σ does not imply a uniform prior on variance σ2). Furthermore, assigning a uniform prior over an unbounded range (e.g., (−∞,∞) for a regression coefficient) results in an improper prior (it doesn't integrate to 1). While improper priors can sometimes lead to proper posteriors, they require careful handling.
Jeffreys Priors: Derived from the Fisher information I(θ), defined as P(θ)∝detI(θ). The Fisher information measures how much information the data provides about the parameter. Jeffreys priors have the desirable property of being invariant under parameter transformations. For example, the Jeffreys prior for the success probability p of a Bernoulli trial is Beta(1/2, 1/2), which is non-uniform. For a location parameter (like the mean μ of a Normal distribution with known variance), it's uniform. For a scale parameter (like the standard deviation σ of a Normal distribution with known mean), P(σ)∝1/σ.
Reference Priors: An information-theoretic approach aiming to maximize the expected Kullback-Leibler divergence between the prior and the posterior. This seeks a prior that allows the data to provide the maximum possible information gain. Reference priors often coincide with Jeffreys priors for single-parameter models but can differ in multi-parameter situations, sometimes addressing issues where Jeffreys priors yield undesirable behavior.
Advantages:
Disadvantages:
In practice, neither purely subjective nor purely objective priors are always ideal. Weakly informative priors (WIPs) offer a pragmatic middle ground. They are proper distributions (integrating to 1) but are intentionally chosen to be less influential than strong subjective priors. They provide gentle regularization, helping to stabilize computation and prevent the posterior from taking on unreasonable values, while still allowing the likelihood to dominate when the data is informative.
Think of WIPs as providing guardrails. For example, instead of a flat prior on a regression coefficient β, which allows arbitrarily large values, we might use a Normal distribution centered at 0 with a relatively large standard deviation, like N(0,102). Or perhaps a Student's t-distribution with few degrees of freedom (e.g., ν=3) and a moderate scale, which has heavier tails than the Normal, allowing for larger deviations from zero while still providing regularization.
Normal distributions centered at zero with increasing variance (1, 9, 100). The blue curve represents a more informative prior, concentrating belief near zero. The green curve is weakly informative, allowing a wider range of values. The gray curve spreads probability much more thinly, approaching a non-informative (locally uniform) stance, though it remains a proper distribution.
Advantages:
Disadvantages:
Because the choice of prior can influence the posterior, it's good practice to perform a prior sensitivity analysis. This involves fitting the model using several different, plausible priors (e.g., slightly different subjective priors, or varying the scale of weakly informative priors) and examining how the key posterior quantities (like parameter estimates or predictions) change. If the results are highly sensitive to reasonable changes in the prior, it suggests the data is not very informative about that aspect of the model, and the prior is playing a significant role. This analysis increases confidence in the findings if the results remain stable across different prior choices.
In summary, selecting a prior P(θ) is an integral part of Bayesian modeling. The choice between subjective, objective, and weakly informative priors depends on the available domain knowledge, the goals of the analysis, and practical considerations like computational stability. Transparency about the chosen prior and sensitivity analysis are important components of a rigorous Bayesian workflow, particularly for the advanced models we will encounter in this course.
© 2025 ApX Machine Learning