Alright, let's put theory into practice. Understanding the mathematical formulas for probability distributions is one thing, but seeing them in action through simulation and visualization provides a much deeper intuition. This is especially relevant when you need to decide if a particular distribution is a good fit for your observed data in a machine learning context.
We'll use Python, specifically the scipy.stats
module, along with numpy
for numerical operations and matplotlib
or seaborn
for plotting (often used with plotly
for interactive web visualizations).
First, ensure you have the necessary libraries installed and import them:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
# Optional: Set a style for seaborn plots
sns.set_style("whitegrid")
Let's start with the Binomial distribution, which models the number of successes k in a fixed number n of independent Bernoulli trials, each with a success probability p.
Imagine we flip a biased coin (p=0.6) 20 times (n=20) and repeat this experiment many times (say, 1000). We can simulate the number of heads obtained in each experiment using scipy.stats.binom.rvs
(random variates).
# Parameters for the Binomial distribution
n_binom = 20 # number of trials
p_binom = 0.6 # probability of success
size_binom = 1000 # number of simulations (experiments)
# Simulate random samples from the Binomial distribution
binomial_samples = stats.binom.rvs(n=n_binom, p=p_binom, size=size_binom)
# Print the first 10 simulated outcomes
print(f"First 10 simulated outcomes (number of successes out of {n_binom}):")
print(binomial_samples[:10])
Now, let's visualize the results of our simulation using a histogram and compare it to the theoretical probability mass function (PMF). The PMF gives the probability of obtaining exactly k successes.
# Calculate the theoretical PMF
k_values_binom = np.arange(0, n_binom + 1)
pmf_binom = stats.binom.pmf(k=k_values_binom, n=n_binom, p=p_binom)
# Create the plot using Matplotlib (or adapt for Plotly)
plt.figure(figsize=(10, 6))
# Plot the histogram of simulated data
# Use density=True to normalize the histogram so it compares with the PMF
plt.hist(binomial_samples, bins=k_values_binom, density=True, alpha=0.6, color='#4dabf7', label='Simulated Data (Histogram)')
# Plot the theoretical PMF
# Use 'o-' to plot points connected by a line
plt.plot(k_values_binom, pmf_binom, 'o-', color='#f03e3e', label='Theoretical PMF')
plt.xlabel(f"Number of Successes (k) in {n_binom} Trials")
plt.ylabel("Probability / Density")
plt.title(f"Binomial Distribution (n={n_binom}, p={p_binom}) Simulation vs. Theory")
plt.legend()
plt.grid(True)
plt.show() # In a web environment, you'd convert this to Plotly JSON
Here is a representation of the plot using Plotly JSON:
{"data": [{"type": "histogram", "x": [12, 13, 14, 11, 12, 13, 10, 15, 12, 11, /* ... many values omitted for brevity ... */ 13, 12, 11, 14, 10, 13, 12, 14, 11, 12], "name": "Simulated Data (Histogram)", "marker": {"color": "#4dabf7"}, "opacity": 0.6, "histnorm": "probability density", "xbins": {"start": -0.5, "end": 20.5, "size": 1}}, {"type": "scatter", "x": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], "y": [1.099511627776e-09, 3.298534883328e-08, 5.2776558133248e-07, 5.1116893895936e-06, 3.476148784923648e-05, 0.00017886386487339264, 0.0007455018403891788, 0.002556006287048398, 0.007377218081138715, 0.017976149861473126, 0.03699028841138237, 0.06536521484361595, 0.09934911167105927, 0.1299511961319861, 0.14619509564848436, 0.139387291822545, 0.111509833458036, 0.074339888972024, 0.039488622545079464, 0.014929686796540646, 0.003018035584896], "mode": "lines+markers", "name": "Theoretical PMF", "marker": {"color": "#f03e3e"}}], "layout": {"title": {"text": "Binomial Distribution (n=20, p=0.6) Simulation vs. Theory"}, "xaxis": {"title": {"text": "Number of Successes (k) in 20 Trials"}}, "yaxis": {"title": {"text": "Probability / Density"}}, "legend": {"title": {"text": "Legend"}}, "bargap": 0.05, "barmode": "overlay"}}
Comparison of 1000 simulated Binomial experiments (n=20, p=0.6) with the theoretical probability mass function. The histogram closely follows the shape predicted by the PMF.
Notice how the histogram of our simulated results closely approximates the shape of the theoretical PMF. As the number of simulations (size_binom
) increases, this approximation generally becomes better.
Next, let's work with the Normal (Gaussian) distribution, characterized by its mean μ and standard deviation σ (or variance σ2). It's arguably the most common distribution in statistics and machine learning, often used to model natural phenomena or errors.
Let's simulate data from a standard Normal distribution (μ=0,σ=1) and compare the histogram of the samples to the theoretical probability density function (PDF).
# Parameters for the Normal distribution
mu_norm = 0 # mean
sigma_norm = 1 # standard deviation
size_norm = 1000 # number of samples
# Simulate random samples from the Normal distribution
normal_samples = stats.norm.rvs(loc=mu_norm, scale=sigma_norm, size=size_norm)
# Print the first 10 simulated values
print(f"First 10 simulated values from N({mu_norm}, {sigma_norm**2}):")
print(normal_samples[:10])
# Generate points for the theoretical PDF curve
x_values_norm = np.linspace(mu_norm - 4*sigma_norm, mu_norm + 4*sigma_norm, 200)
pdf_norm = stats.norm.pdf(x_values_norm, loc=mu_norm, scale=sigma_norm)
# Create the plot
plt.figure(figsize=(10, 6))
# Plot the histogram of simulated data
plt.hist(normal_samples, bins=30, density=True, alpha=0.6, color='#74c0fc', label='Simulated Data (Histogram)')
# Plot the theoretical PDF
plt.plot(x_values_norm, pdf_norm, color='#f76707', linewidth=2, label='Theoretical PDF')
plt.xlabel("Value")
plt.ylabel("Density")
plt.title(f"Normal Distribution (μ={mu_norm}, σ={sigma_norm}) Simulation vs. Theory")
plt.legend()
plt.grid(True)
plt.show() # Again, adapt for Plotly in web deployment
Here's the corresponding Plotly JSON representation:
{"data": [{"type": "histogram", "x": [-0.626, 0.183, -0.835, 1.595, 0.329, -0.82, 0.488, -0.306, -0.634, 0.891, /* ... many values omitted ... */ -0.76, 0.338, -0.587, 1.344, -0.529, 0.769, -0.95, -0.028, 0.49, 1.368], "name": "Simulated Data (Histogram)", "marker": {"color": "#74c0fc"}, "opacity": 0.6, "histnorm": "probability density", "nbinsx": 30}, {"type": "scatter", "x": [-4.0, -3.979, /* ... many values omitted ... */ 3.979, 4.0], "y": [0.000133, 0.000155, /* ... many values omitted ... */ 0.000155, 0.000133], "mode": "lines", "name": "Theoretical PDF", "line": {"color": "#f76707", "width": 2}}], "layout": {"title": {"text": "Normal Distribution (\u03bc=0, \u03c3=1) Simulation vs. Theory"}, "xaxis": {"title": {"text": "Value"}}, "yaxis": {"title": {"text": "Density"}}, "legend": {"title": {"text": "Legend"}}, "bargap": 0.05, "barmode": "overlay"}}
Histogram of 1000 samples drawn from a standard Normal distribution (N(0, 1)), overlaid with the theoretical probability density function (PDF). The simulated data closely follows the characteristic bell shape.
Again, the histogram of the simulated data provides a good approximation of the bell-shaped curve defined by the theoretical PDF. This visualization helps confirm that our random samples behave as expected for a Normal distribution.
Let's do one more: the Poisson distribution. It models the probability of a given number of events occurring in a fixed interval of time or space, given the average rate of occurrence λ. Example: number of emails arriving per hour.
Assume emails arrive at an average rate of λ=5 per hour. We'll simulate the number of emails received in 1000 different hours.
# Parameters for the Poisson distribution
lambda_poisson = 5 # average rate (e.g., emails per hour)
size_poisson = 1000 # number of simulations (hours observed)
# Simulate random samples from the Poisson distribution
poisson_samples = stats.poisson.rvs(mu=lambda_poisson, size=size_poisson)
# Print the first 10 simulated counts
print(f"First 10 simulated counts (events per interval, λ={lambda_poisson}):")
print(poisson_samples[:10])
# Calculate the theoretical PMF
# Choose a reasonable upper limit for k, e.g., mean + few standard deviations
k_max_poisson = int(lambda_poisson + 4 * np.sqrt(lambda_poisson))
k_values_poisson = np.arange(0, k_max_poisson + 1)
pmf_poisson = stats.poisson.pmf(k=k_values_poisson, mu=lambda_poisson)
# Create the plot
plt.figure(figsize=(10, 6))
# Plot the histogram of simulated data
# Align bins correctly for discrete data
bins_poisson = np.arange(poisson_samples.min(), poisson_samples.max() + 2) - 0.5
plt.hist(poisson_samples, bins=bins_poisson, density=True, alpha=0.6, color='#69db7c', label='Simulated Data (Histogram)')
# Plot the theoretical PMF
plt.plot(k_values_poisson, pmf_poisson, 'o-', color='#be4bdb', label='Theoretical PMF')
plt.xlabel(f"Number of Events (k) per Interval")
plt.ylabel("Probability / Density")
plt.title(f"Poisson Distribution (λ={lambda_poisson}) Simulation vs. Theory")
plt.legend()
plt.grid(True)
plt.xticks(k_values_poisson) # Ensure ticks are at integer values
plt.xlim(bins_poisson.min(), bins_poisson.max())
plt.show()
Comparison of 1000 simulated Poisson experiments (lambda=5) with the theoretical probability mass function. The histogram mirrors the PMF, peaking around the average rate lambda.
This practical exercise demonstrates how simulation helps bridge the gap between abstract distribution formulas and concrete data. By generating samples and plotting them against theoretical functions using libraries like SciPy and Matplotlib/Plotly, you gain a better feel for the shape, spread, and characteristics of common probability distributions. This skill is valuable for exploratory data analysis, model selection, and understanding the probabilistic foundations of many machine learning algorithms.
© 2025 ApX Machine Learning