Okay, let's put the theory of sampling distributions and interval estimation into practice. We've discussed how the Central Limit Theorem allows us to make inferences about a population mean using the mean of a sample, even if the original population isn't normally distributed. We also covered how confidence intervals give us a range of plausible values for that population parameter. Now, we'll use Python to simulate these processes and see them work firsthand.
The Central Limit Theorem (CLT) is a cornerstone of inferential statistics. It states that if you draw sufficiently large random samples from a population (with replacement or from a large enough population), the distribution of the sample means will be approximately normally distributed, regardless of the shape of the original population distribution. Let's visualize this.
Imagine our population follows an Exponential distribution. This distribution is decidedly not normal; it's heavily skewed.
import numpy as np
import scipy.stats as stats
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# Define population parameters (Exponential distribution)
population_scale = 10 # Scale parameter (beta) for Exponential, mean is beta
population_size = 100000
# Generate the population data
np.random.seed(42) # for reproducibility
population_data = np.random.exponential(scale=population_scale, size=population_size)
# Plot the population distribution
fig_pop = go.Figure(data=[go.Histogram(x=population_data, nbinsx=100, name='Population', marker_color='#1f77b4')]) # Using a default blue
fig_pop.update_layout(
title_text='Population Distribution (Exponential)',
xaxis_title_text='Value',
yaxis_title_text='Frequency',
bargap=0.1
)
# fig_pop.show() # In a real notebook environment
The plot above shows the characteristic right-skewed shape of the Exponential distribution representing our population. The true mean of this distribution is the scale parameter, which we set to 10.
Now, let's simulate the process of drawing many samples from this population and calculating the mean of each sample. We'll choose a sample size, say n=30, which is often considered large enough for the CLT to start taking effect. We'll draw 1000 such samples.
sample_size = 30
number_of_samples = 1000
sample_means = []
np.random.seed(123) # different seed for sampling
for _ in range(number_of_samples):
sample = np.random.choice(population_data, size=sample_size, replace=True)
sample_mean = np.mean(sample)
sample_means.append(sample_mean)
# Calculate theoretical mean and standard error for sampling distribution
theoretical_mean = population_scale # Mean of exponential is beta
theoretical_std_dev_pop = population_scale # Std dev of exponential is beta
standard_error = theoretical_std_dev_pop / np.sqrt(sample_size)
# Plot the distribution of sample means
fig_means = go.Figure()
# Histogram of sample means
fig_means.add_trace(go.Histogram(
x=sample_means,
nbinsx=30,
name='Sample Means',
marker_color='#ff7f0e', # Using a default orange
histnorm='probability density' # Normalize to compare with PDF
))
# Overlay the theoretical normal distribution (CLT prediction)
x_norm = np.linspace(min(sample_means), max(sample_means), 100)
y_norm = stats.norm.pdf(x_norm, loc=theoretical_mean, scale=standard_error)
fig_means.add_trace(go.Scatter(
x=x_norm,
y=y_norm,
mode='lines',
name='Normal Distribution (CLT)',
line=dict(color='#2ca02c', width=2) # Using a default green
))
fig_means.update_layout(
title_text=f'Distribution of Sample Means (n={sample_size})',
xaxis_title_text='Sample Mean',
yaxis_title_text='Density',
bargap=0.1,
legend_title_text='Distribution'
)
# fig_means.show() # In a real notebook environment
# Print comparison stats
print(f"Population Mean: {np.mean(population_data):.4f}")
print(f"Average of Sample Means: {np.mean(sample_means):.4f}")
print(f"Population Std Dev: {np.std(population_data):.4f}")
print(f"Std Dev of Sample Means (Observed): {np.std(sample_means):.4f}")
print(f"Theoretical Standard Error (Pop Std Dev / sqrt(n)): {standard_error:.4f}")
Histogram of the population data, showing its exponential shape. The true population mean is 10.
Distribution of 1000 sample means (orange histogram) drawn from the exponential population, with sample size n=30. The green line shows the theoretical normal distribution predicted by the Central Limit Theorem, centered at the population mean (10) with a standard deviation equal to the standard error.
As the simulation results show:
This simulation demonstrates the power of the CLT. It allows us to use the properties of the normal distribution to make inferences about the population mean using just one sample mean, provided the sample size is adequate.
Now, let's take one sample and use it to estimate the population mean with a confidence interval. Typically, we won't know the population standard deviation (σ), so we'll use the sample standard deviation (s) as an estimate. When σ is unknown and the sample size is relatively small (often considered n<30, though the distinction blurs with larger n), the t-distribution is more appropriate than the normal (Z) distribution for finding the critical value. However, for n=30 or larger, the t-distribution is very close to the normal distribution. We'll use the t-distribution here for generality, as provided by scipy.stats
.
Let's take the last sample we generated in the simulation loop above:
# Use the last sample generated
last_sample = sample # From the loop above
# Calculate sample statistics
sample_mean = np.mean(last_sample)
sample_std_dev = np.std(last_sample, ddof=1) # Use ddof=1 for sample standard deviation
n = len(last_sample)
standard_error_sample = sample_std_dev / np.sqrt(n)
# Set confidence level
confidence_level = 0.95
alpha = 1 - confidence_level
# Calculate critical t-value
degrees_freedom = n - 1
critical_t = stats.t.ppf(1 - alpha / 2, df=degrees_freedom) # ppf gives the inverse CDF
# Calculate margin of error
margin_of_error = critical_t * standard_error_sample
# Calculate confidence interval
lower_bound = sample_mean - margin_of_error
upper_bound = sample_mean + margin_of_error
print(f"Sample Mean (x_bar): {sample_mean:.4f}")
print(f"Sample Standard Deviation (s): {sample_std_dev:.4f}")
print(f"Standard Error (SE): {standard_error_sample:.4f}")
print(f"Degrees of Freedom: {degrees_freedom}")
print(f"Confidence Level: {confidence_level*100}%")
print(f"Critical t-value: {critical_t:.4f}")
print(f"Margin of Error: {margin_of_error:.4f}")
print(f"Confidence Interval: ({lower_bound:.4f}, {upper_bound:.4f})")
# Verification using scipy.stats built-in function
ci_scipy = stats.t.interval(confidence_level, df=degrees_freedom, loc=sample_mean, scale=standard_error_sample)
print(f"Confidence Interval (using scipy.stats.t.interval): ({ci_scipy[0]:.4f}, {ci_scipy[1]:.4f})")
Running this code will output the calculated sample statistics and the resulting 95% confidence interval. For instance, you might get:
Sample Mean (x_bar): 10.1325
Sample Standard Deviation (s): 9.8536
Standard Error (SE): 1.7990
Degrees of Freedom: 29
Confidence Level: 95.0%
Critical t-value: 2.0452
Margin of Error: 3.6794
Confidence Interval: (6.4531, 13.8119)
Confidence Interval (using scipy.stats.t.interval): (6.4531, 13.8119)
Interpretation: Based on this specific sample, we are 95% confident that the true mean of the population (which we know is 10) lies between approximately 6.45 and 13.81. Notice that our interval successfully captured the true population mean.
If we were to repeat this process with many different samples, we would find that about 95% of the calculated 95% confidence intervals would contain the true population mean.
To reinforce the interpretation of confidence intervals, let's simulate drawing multiple samples (say, 100) and calculate a 95% confidence interval for each. We can then plot these intervals and see how many capture the true population mean.
num_intervals = 100
intervals = []
means = []
population_mean = population_scale # True mean = 10
np.random.seed(999)
for i in range(num_intervals):
sample_i = np.random.choice(population_data, size=sample_size, replace=True)
mean_i = np.mean(sample_i)
std_i = np.std(sample_i, ddof=1)
se_i = std_i / np.sqrt(sample_size)
df_i = sample_size - 1
# Use scipy's interval function directly
ci_i = stats.t.interval(confidence_level, df=df_i, loc=mean_i, scale=se_i)
intervals.append(ci_i)
means.append(mean_i)
# Create plot
fig_intervals = go.Figure()
captured_count = 0
for i in range(num_intervals):
lower, upper = intervals[i]
mean_val = means[i]
captured = lower <= population_mean <= upper
if captured:
captured_count += 1
line_color = '#2ca02c' # Green if captured
else:
line_color = '#d62728' # Red if not captured
# Draw the interval line
fig_intervals.add_trace(go.Scatter(
x=[lower, upper],
y=[i, i],
mode='lines',
line=dict(color=line_color, width=1),
showlegend=False
))
# Draw the sample mean point
fig_intervals.add_trace(go.Scatter(
x=[mean_val],
y=[i],
mode='markers',
marker=dict(color=line_color, size=3),
showlegend=False
))
# Add line for the true population mean
fig_intervals.add_vline(x=population_mean, line=dict(color='#1f77b4', width=2, dash='dash'), name='True Mean')
fig_intervals.update_layout(
title=f'{num_intervals} Confidence Intervals ({confidence_level*100}%)',
xaxis_title='Value',
yaxis_title='Sample Number',
yaxis_showticklabels=False,
xaxis_range=[min(i[0] for i in intervals if i[0] is not None)-1, max(i[1] for i in intervals if i[1] is not None)+1], # Adjust range slightly
height=400
)
# Add annotation for capture rate
capture_rate = captured_count / num_intervals
fig_intervals.add_annotation(
x=population_mean + 2, y=num_intervals * 0.95, # Position annotation
text=f"Capture Rate: {capture_rate:.2f} ({captured_count}/{num_intervals})",
showarrow=False,
font=dict(size=12, color="black"),
bgcolor="rgba(255,255,255,0.7)"
)
# fig_intervals.show()
Each horizontal line represents a 95% confidence interval calculated from a different random sample (n=30) from the population. The dot on each line is the sample mean. The vertical dashed line is the true population mean (10). Green intervals successfully capture the true mean; red intervals do not. We expect about 95% of the intervals to be green.
This visualization clearly shows that while any single confidence interval might or might not capture the true population parameter, the method ensures that if repeated many times, about 95% of 95% confidence intervals will contain the true value. The observed capture rate in the simulation (e.g., 94/100 = 94%) is typically very close to the nominal confidence level (95%).
These practical exercises demonstrate how simulation can help build intuition for abstract statistical concepts like the Central Limit Theorem and confidence intervals. They also show how easily these calculations can be performed using standard Python libraries, making inferential statistics accessible for data analysis and machine learning tasks.
© 2025 ApX Machine Learning