Understanding probability distributions is one thing, but seeing them in action by generating data that follows these distributions can significantly aid comprehension. Modern scientific computing libraries in Python, particularly SciPy and NumPy, provide powerful tools to generate random numbers (samples) from a wide variety of probability distributions. This process is often called sampling.
Sampling is useful for many tasks, including:
We will primarily use the scipy.stats
module, which offers a consistent interface for working with distributions, including generating random variates (samples) using the .rvs()
method. We'll also occasionally mention equivalent functions in numpy.random
.
Let's start by importing the necessary libraries. We'll need scipy.stats
for the distributions and matplotlib.pyplot
(often imported as plt
) for basic visualization, though we'll render charts using Plotly format for interactive web display. We'll also use numpy
for numerical operations.
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt # We use this, but output Plotly JSON
# Configure visualizations (optional, helps make plots nicer with matplotlib)
# plt.style.use('seaborn-v0_8-whitegrid')
Discrete distributions deal with countable outcomes. We'll look at the Bernoulli and Binomial distributions.
The Bernoulli distribution models a single trial with two possible outcomes: success (usually coded as 1) with probability p, and failure (usually coded as 0) with probability 1−p. Think of a single coin flip.
To generate samples from a Bernoulli distribution using scipy.stats
, we use stats.bernoulli.rvs()
. The main parameter is p, the probability of success.
# Parameters
prob_success = 0.7 # Probability of success (e.g., heads)
num_samples = 10 # Number of trials (samples) to generate
# Generate samples
# Each sample is either 0 or 1
bernoulli_samples = stats.bernoulli.rvs(p=prob_success, size=num_samples)
print(f"Bernoulli Samples (p={prob_success}): {bernoulli_samples}")
Running this might produce output like: Bernoulli Samples (p=0.7): [1 1 0 1 1 1 0 1 0 1]
. Each number represents the outcome of one trial. If you generate many samples, you'd expect the proportion of 1s to be close to p.
(Equivalent NumPy function: np.random.binomial(1, p, size=num_samples)
)
The Binomial distribution models the number of successes in a fixed number, n, of independent Bernoulli trials, each with the same probability of success p. For example, counting the number of heads in 10 coin flips.
We use stats.binom.rvs()
, specifying n (number of trials) and p (probability of success per trial). The size
parameter indicates how many times we want to run this experiment (i.e., how many samples of the count of successes we want).
# Parameters
num_trials = 10 # Number of Bernoulli trials in one experiment (n)
prob_success = 0.5 # Probability of success in each trial (p)
num_experiments = 1000 # Number of times we run the experiment (generate samples)
# Generate samples
# Each sample is the count of successes in 'n' trials
binomial_samples = stats.binom.rvs(n=num_trials, p=prob_success, size=num_experiments)
print(f"First 10 Binomial Samples (n={num_trials}, p={prob_success}): {binomial_samples[:10]}")
# Example Output: First 10 Binomial Samples (n=10, p=0.5): [5 6 5 4 7 5 6 5 5 3]
Each number in the output represents the total number of successes obtained in one set of 10 trials. To visualize the distribution of these counts, we can create a histogram.
# Visualization (code using Matplotlib)
# plt.figure(figsize=(8, 4))
# plt.hist(binomial_samples, bins=np.arange(num_trials + 2) - 0.5, density=True, alpha=0.7, color='#15aabf', edgecolor='black')
# plt.title(f'Binomial Distribution Samples (n={num_trials}, p={prob_success})')
# plt.xlabel('Number of Successes')
# plt.ylabel('Probability Density')
# plt.xticks(range(num_trials + 1))
# plt.grid(axis='y')
# plt.show()
# Actual Plotly JSON output for the histogram
hist_counts, bin_edges = np.histogram(binomial_samples, bins=np.arange(num_trials + 2) - 0.5, density=True)
bin_centers = 0.5 * (bin_edges[:-1] + bin_edges[1:])
{"data": [{"type": "bar", "x": [str(int(x)) for x in bin_centers], "y": hist_counts.tolist(), "name": "Sample Frequency", "marker": {"color": "#15aabf", "line": {"color": "#495057", "width": 1}}}], "layout": {"title": {"text": "Simulated Binomial Distribution (n=10, p=0.5)"}, "xaxis": {"title": {"text": "Number of Successes"}}, "yaxis": {"title": {"text": "Estimated Probability"}}, "bargap": 0.1, "width": 600, "height": 400}}
A histogram of 1000 samples drawn from a Binomial distribution with n=10 trials and success probability p=0.5. The shape approximates the theoretical Binomial PMF, centered around n×p=5.
Continuous distributions describe outcomes over a continuous range.
The Uniform distribution assigns equal probability density to all outcomes within a specified range [a,b). Outcomes outside this range have zero probability.
We use stats.uniform.rvs()
. It takes loc
(the starting point, a) and scale
(the width of the range, b−a) as parameters.
# Parameters
lower_bound = 5.0 # Start of the interval (a)
upper_bound = 10.0 # End of the interval (b)
num_samples = 1000
# Calculate loc and scale
loc_param = lower_bound
scale_param = upper_bound - lower_bound
# Generate samples
uniform_samples = stats.uniform.rvs(loc=loc_param, scale=scale_param, size=num_samples)
print(f"First 10 Uniform Samples (range=[{lower_bound}, {upper_bound})): {uniform_samples[:10]}")
# Example Output: First 10 Uniform Samples (range=[5.0, 10.0)): [7.82 9.21 5.34 6.78 8.89 5.01 9.98 7.11 6.05 8.43]
(Equivalent NumPy function: np.random.uniform(low=lower_bound, high=upper_bound, size=num_samples)
)
A histogram of these samples should appear roughly flat across the interval [5,10).
# Visualization (code using Matplotlib)
# plt.figure(figsize=(8, 4))
# plt.hist(uniform_samples, bins=20, density=True, alpha=0.7, color='#fd7e14', edgecolor='black')
# plt.title(f'Uniform Distribution Samples (range=[{lower_bound}, {upper_bound}))')
# plt.xlabel('Value')
# plt.ylabel('Probability Density')
# plt.grid(axis='y')
# plt.show()
# Actual Plotly JSON output for the histogram
hist_counts, bin_edges = np.histogram(uniform_samples, bins=20, density=True)
bin_centers = 0.5 * (bin_edges[:-1] + bin_edges[1:])
{"data": [{"type": "bar", "x": bin_centers.tolist(), "y": hist_counts.tolist(), "name": "Sample Density", "marker": {"color": "#fd7e14", "line": {"color": "#495057", "width": 1}}}], "layout": {"title": {"text": "Simulated Uniform Distribution (Range=[5, 10))"}, "xaxis": {"title": {"text": "Value"}}, "yaxis": {"title": {"text": "Estimated Density"}}, "bargap": 0.05, "width": 600, "height": 400}}
A histogram of 1000 samples drawn from a Uniform distribution over the interval [5,10). The density is approximately constant within this range.
The Normal distribution, often called the bell curve, is perhaps the most common continuous distribution. It's characterized by its mean (μ, loc
) and standard deviation (σ, scale
). The distribution is symmetric around the mean.
We use stats.norm.rvs()
with loc
for the mean and scale
for the standard deviation.
# Parameters
mean_val = 0.0 # Mean (mu)
std_dev = 1.0 # Standard Deviation (sigma)
num_samples = 1000
# Generate samples
normal_samples = stats.norm.rvs(loc=mean_val, scale=std_dev, size=num_samples)
print(f"First 10 Normal Samples (mean={mean_val}, std_dev={std_dev}): {normal_samples[:10]}")
# Example Output: First 10 Normal Samples (mean=0.0, std_dev=1.0): [-0.54 1.25 0.21 -1.87 0.88 -0.76 0.33 -0.11 -0.45 1.05]
(Equivalent NumPy function: np.random.normal(loc=mean_val, scale=std_dev, size=num_samples)
)
A histogram of normal samples will show the characteristic bell shape, centered at the mean.
# Visualization (code using Matplotlib)
# plt.figure(figsize=(8, 4))
# plt.hist(normal_samples, bins=30, density=True, alpha=0.7, color='#4263eb', edgecolor='black')
# plt.title(f'Normal Distribution Samples (mean={mean_val}, std_dev={std_dev})')
# plt.xlabel('Value')
# plt.ylabel('Probability Density')
# plt.grid(axis='y')
# plt.show()
# Actual Plotly JSON output for the histogram
hist_counts, bin_edges = np.histogram(normal_samples, bins=30, density=True)
bin_centers = 0.5 * (bin_edges[:-1] + bin_edges[1:])
{"data": [{"type": "bar", "x": bin_centers.tolist(), "y": hist_counts.tolist(), "name": "Sample Density", "marker": {"color": "#4263eb", "line": {"color": "#495057", "width": 1}}}], "layout": {"title": {"text": "Simulated Normal Distribution (Mean=0, StdDev=1)"}, "xaxis": {"title": {"text": "Value"}}, "yaxis": {"title": {"text": "Estimated Density"}}, "bargap": 0.05, "width": 600, "height": 400}}
A histogram of 1000 samples drawn from a Standard Normal distribution (μ=0,σ=1). The distribution clearly shows the characteristic bell shape centered at 0.
Being able to generate samples from these fundamental distributions is a practical skill. It allows you to simulate data that mirrors real-world phenomena characterized by these patterns, providing a basis for experiments, testing hypotheses, and understanding the inputs or outputs of machine learning models that rely on probabilistic assumptions. As you encounter more complex distributions, the process of sampling using libraries like SciPy remains similar.
© 2025 ApX Machine Learning