Once the structure G of a Bayesian Network (BN) is defined, either through domain expertise or structure learning algorithms, the next significant step is to quantify the probabilistic relationships between connected nodes. This involves learning the parameters of the model, which typically define the Conditional Probability Distributions (CPDs) associated with each variable given its parents in the graph. In contrast to simply finding a single best estimate for these parameters (like Maximum Likelihood Estimation), the Bayesian approach treats the parameters themselves as random variables and aims to compute their posterior distribution given the observed data.
In a Bayesian Network, the parameters, collectively denoted as θ, represent the conditional probabilities that define the model. For a network with variables X1,…,Xn, the parameters θ consist of sets θi∣pa(Xi) for each variable Xi, specifying the probability distribution P(Xi∣pa(Xi),θi∣pa(Xi)) for each configuration of its parents pa(Xi).
The core idea of Bayesian parameter learning is to start with a prior distribution P(θ∣G) over these parameters, reflecting our beliefs before observing any data D. We then use Bayes' theorem to update these beliefs based on the data, resulting in a posterior distribution P(θ∣D,G):
P(θ∣D,G)=P(D∣G)P(D∣θ,G)P(θ∣G)Here, P(D∣θ,G) is the likelihood of observing the data D given a specific set of parameters θ and the graph structure G. The term P(D∣G) is the marginal likelihood or evidence, often intractable to compute directly, but serves as a normalizing constant.
A fundamental property of Bayesian Networks simplifies this process considerably. Given a complete dataset (no missing values), the likelihood function decomposes according to the graph structure:
P(D∣θ,G)=i=1∏nj=1∏qiP(θij∣D)where θij represents the parameters for variable Xi when its parents pa(Xi) are in their j-th configuration (out of qi possible configurations). Furthermore, if we assume parameter independence in the prior, meaning P(θ∣G)=∏i∏jP(θij), then the posterior also decomposes:
P(θ∣D,G)=i=1∏nj=1∏qiP(θij∣D)This decomposition implies we can learn the parameters for each local CPD independently, breaking down a potentially massive learning problem into smaller, manageable ones.
Let's consider the common case where all variables in the BN are discrete. Each variable Xi can take one of ri states. The parameters θijk represent the probability that variable Xi takes its k-th state, given that its parents pa(Xi) are in their j-th configuration:
θijk=P(Xi=k∣pa(Xi)=j,θij)where ∑k=1riθijk=1 for all i,j.
The likelihood contribution for a specific CPD P(Xi∣pa(Xi)=j) given the data D follows a multinomial distribution based on the counts Nijk - the number of times in the dataset D where Xi=k and pa(Xi)=j.
For the prior distribution over the parameters θij=(θij1,…,θijri), a convenient and widely used choice is the Dirichlet distribution. The Dirichlet distribution Dir(αij1,…,αijri) is defined over the probability simplex (where components are non-negative and sum to 1). Its probability density function is:
P(θij∣αij)=∏k=1riΓ(αijk)Γ(∑k=1riαijk)k=1∏riθijkαijk−1The hyperparameters αijk>0 can be interpreted as pseudo-counts or prior counts, reflecting prior belief about the occurrences of each state k for Xi given parent configuration j. The sum αij=∑kαijk represents the strength of this prior belief (equivalent sample size).
The Dirichlet prior is conjugate to the multinomial likelihood. This means that if the prior is Dirichlet and the likelihood is multinomial, the posterior distribution is also Dirichlet. Specifically, the posterior for θij is:
P(θij∣D,G)=Dir(αij1+Nij1,…,αijri+Nijri)This elegant result means the posterior hyperparameters are simply the prior hyperparameters updated by the corresponding counts observed in the data. Learning parameters amounts to counting occurrences in the data and adding them to the prior pseudo-counts.
While the Bayesian approach yields a full posterior distribution P(θ∣D,G), often we need a single point estimate for the parameters, for example, to populate a CPT for inference. Common choices include:
Posterior Mean (Bayesian Estimate): This is the expected value of the parameters under the posterior distribution. For a Dirichlet posterior Dir(αij1′,…,αijri′) where αijk′=αijk+Nijk, the posterior mean is:
E[θijk∣D,G]=∑l=1riαijl′αijk′=∑l=1ri(αijl+Nijl)αijk+NijkNotice how this estimate smooths the empirical frequencies Nijk/∑lNijl (which is the Maximum Likelihood Estimate, MLE) using the prior pseudo-counts αijk. This helps prevent zero probabilities for events not seen in the data, especially important with smaller datasets. A common non-informative prior choice is the Laplace smoothing prior where αijk=1 for all i,j,k.
Maximum A Posteriori (MAP) Estimate: This finds the parameter values that maximize the posterior probability density. For the Dirichlet posterior, the MAP estimate is:
θijkMAP=∑l=1ri(αijl′−1)αijk′−1=∑l=1ri(αijl+Nijl−1)αijk+Nijk−1provided all αijk′>1. If any αijk′≤1, the mode is at the boundary of the simplex.
The choice between posterior mean and MAP depends on the application, but the posterior mean is often preferred for its averaging nature and connection to minimizing squared error loss.
Consider a simple parameter θ representing the probability of a coin landing heads. Let's assume a Beta prior (which is a Dirichlet distribution with K=2 states: heads, tails), say Beta(αH=2,αT=2). This reflects a prior belief centered at 0.5 but with some uncertainty.
Suppose we flip the coin 10 times and observe NH=7 heads and NT=3 tails. The posterior distribution becomes:
P(θ∣D)=Beta(αH+NH,αT+NT)=Beta(2+7,2+3)=Beta(9,5)The posterior mean estimate for θ (probability of heads) is:
E[θ∣D]=9+59=149≈0.643The MLE would be 7/10=0.7. The Bayesian estimate is pulled slightly towards the prior mean (0.5).
We can visualize this update:
The plot shows how the probability density shifts and becomes narrower (more confident) after observing data. The prior (blue) is centered at 0.5, while the posterior (pink) peaks near 0.64, reflecting the observed data (7 heads, 3 tails).
This principle extends directly to the multinomial parameters in CPTs of Bayesian Networks, where each row of a CPT corresponding to a specific parent configuration is updated independently using its relevant prior counts and observed data counts.
While elegant for complete data and conjugate priors, Bayesian parameter learning faces challenges:
In summary, Bayesian parameter learning provides a principled way to estimate the parameters of a Bayesian Network by combining prior knowledge with observed data. The use of conjugate priors like the Dirichlet distribution greatly simplifies calculations for discrete networks with complete data, allowing for efficient updates and providing a full posterior distribution that captures parameter uncertainty. This uncertainty is a distinct advantage over point estimates like MLE and can be propagated through subsequent inference tasks.
© 2025 ApX Machine Learning