Now that we've discussed the theoretical underpinnings of hypothesis testing, including null and alternative hypotheses, error types, and p-values, let's put this knowledge into practice. T-tests are workhorses for comparing the means of one or two groups, a common task in analyzing experiment results or model performance metrics. In this section, we'll use Python's SciPy library to perform both one-sample and two-sample t-tests on simulated data.
Imagine two common scenarios in machine learning or data analysis:
Let's simulate some data for these scenarios using NumPy. For the one-sample test, we'll simulate 30 processing times. For the two-sample test, we'll simulate session durations for 50 users in Group A and 55 users in Group B.
import numpy as np
from scipy import stats
# Seed for reproducibility
np.random.seed(42)
# --- One-Sample Scenario Data ---
# Standard processing time benchmark (population mean under H0)
benchmark_time = 150
# Sample data for our new algorithm (30 images)
# Let's assume our algorithm is slightly faster on average, e.g., around 145ms
sample_processing_times = np.random.normal(loc=145, scale=12, size=30)
print(f"Sample Mean Processing Time: {np.mean(sample_processing_times):.2f} ms")
# --- Two-Sample Scenario Data ---
# Group A (Original Design) - 50 users
# Assume average session duration is 5 minutes with some variance
group_a_durations = np.random.normal(loc=5.0, scale=1.5, size=50)
# Group B (New Design) - 55 users
# Assume average session duration is potentially higher, e.g., 5.8 minutes
group_b_durations = np.random.normal(loc=5.8, scale=1.7, size=55)
print(f"Group A Mean Duration: {np.mean(group_a_durations):.2f} mins")
print(f"Group B Mean Duration: {np.mean(group_b_durations):.2f} mins")
This setup gives us realistic-looking data to work with. np.random.normal
generates data following a normal distribution with a specified mean (loc
) and standard deviation (scale
).
For the first scenario, we want to test if our sample mean is significantly different from the benchmark of 150ms.
We use the scipy.stats.ttest_1samp
function. It takes the sample data and the population mean under the null hypothesis as arguments.
# Perform the one-sample t-test
t_statistic_1samp, p_value_1samp = stats.ttest_1samp(
a=sample_processing_times,
popmean=benchmark_time
)
print(f"One-Sample T-test Results:")
print(f" T-statistic: {t_statistic_1samp:.4f}")
print(f" P-value: {p_value_1samp:.4f}")
# Interpretation
alpha = 0.05 # Significance level
if p_value_1samp < alpha:
print(f" Conclusion: Reject H0. The mean processing time ({np.mean(sample_processing_times):.2f} ms) is significantly different from {benchmark_time} ms (p={p_value_1samp:.4f}).")
else:
print(f" Conclusion: Fail to reject H0. There is not enough evidence to say the mean processing time is different from {benchmark_time} ms (p={p_value_1samp:.4f}).")
Interpreting the Output: The function returns the calculated t-statistic and the corresponding p-value.
If the p-value is less than our chosen significance level α (commonly 0.05), we reject the null hypothesis (H0). This suggests the observed difference is statistically significant. Otherwise, we fail to reject H0. Based on our simulated data (which was centered around 145ms), we likely find a significant difference from 150ms.
Now, let's address the A/B test scenario. We want to compare the means of two independent groups (Group A and Group B).
We use the scipy.stats.ttest_ind
function. It takes the data for both groups as input. An important parameter is equal_var
. By default, it's set to False
, performing Welch's t-test, which does not assume equal variances between the two groups. This is generally safer unless you have strong reasons to believe the variances are equal.
# Perform the two-sample independent t-test (Welch's t-test by default)
t_statistic_ind, p_value_ind = stats.ttest_ind(
a=group_a_durations,
b=group_b_durations,
equal_var=False # Perform Welch's t-test
)
print(f"\nTwo-Sample Independent T-test Results:")
print(f" T-statistic: {t_statistic_ind:.4f}")
print(f" P-value: {p_value_ind:.4f}")
# Interpretation
alpha = 0.05 # Significance level
if p_value_ind < alpha:
print(f" Conclusion: Reject H0. There is a statistically significant difference in mean session duration between Group A ({np.mean(group_a_durations):.2f} mins) and Group B ({np.mean(group_b_durations):.2f} mins) (p={p_value_ind:.4f}).")
else:
print(f" Conclusion: Fail to reject H0. There is not enough evidence to claim a significant difference in mean session duration between the groups (p={p_value_ind:.4f}).")
Visualizing the A/B Test Data
A quick visualization can help understand the data being compared. Box plots are effective for showing the distribution and central tendency of each group.
{"layout": {"title": "Session Duration Comparison (A/B Test)", "xaxis": {"title": "Group"}, "yaxis": {"title": "Session Duration (minutes)"}, "boxmode": "group", "margin": {"l": 40, "r": 20, "t": 40, "b": 40}, "height": 350}, "data": [{"type": "box", "y": group_a_durations.tolist(), "name": "Group A", "marker": {"color": "#228be6"}}, {"type": "box", "y": group_b_durations.tolist(), "name": "Group B", "marker": {"color": "#fd7e14"}}]}
Box plots showing the distribution of session durations for Group A (original design) and Group B (new design). The central line is the median, the box represents the interquartile range (IQR), and whiskers typically extend to 1.5 times the IQR.
Interpreting the Two-Sample Results: Similar to the one-sample test, we compare the p-value to our significance level α. If p<α, we conclude that the difference in means observed between Group A and Group B is statistically significant. Given how we generated the data (Group B mean set to 5.8 vs. Group A mean set to 5.0), we expect the test to detect this difference.
scipy.stats.ttest_rel
), which analyzes the differences between paired observations.This practice section demonstrated how to apply one-sample and two-sample t-tests using SciPy. These tests are fundamental tools for comparing means, enabling data-driven decisions in contexts ranging from algorithm evaluation to A/B testing. Remember to interpret the p-value correctly within the context of your hypotheses and chosen significance level.
© 2025 ApX Machine Learning