Now that we've established the framework for hypothesis testing, let's see how to carry out these tests using Python. The scipy.stats
module provides a comprehensive set of functions for performing common statistical tests, making the implementation straightforward. We'll focus on the tests introduced earlier: t-tests, Chi-squared tests, and a brief look at ANOVA.
T-tests are primarily used to compare means. SciPy offers functions for the main variants. Remember, these tests generally assume that the data follows a normal distribution, although they are somewhat robust to violations, especially with larger sample sizes.
This test checks if the mean of a single sample is significantly different from a known or hypothesized population mean (μ0). We use the scipy.stats.ttest_1samp
function.
Let's say we have collected data on the processing time (in ms) for a specific type of query on our new system. We want to test if the average processing time is significantly different from a target benchmark of 150 ms.
import numpy as np
from scipy import stats
# Sample processing times (e.g., from 25 queries)
processing_times = np.array([145, 155, 162, 148, 153, 160, 149, 151, 158, 147,
154, 163, 150, 146, 157, 152, 149, 161, 156, 150,
148, 159, 155, 164, 151])
# Hypothesized population mean
mu_0 = 150
# Perform the one-sample t-test
t_statistic, p_value = stats.ttest_1samp(a=processing_times, popmean=mu_0)
print(f"Sample Mean: {np.mean(processing_times):.2f}")
print(f"T-statistic: {t_statistic:.4f}")
print(f"P-value: {p_value:.4f}")
# Interpretation
alpha = 0.05
if p_value < alpha:
print(f"Reject the null hypothesis (p={p_value:.4f}). The mean processing time is significantly different from {mu_0} ms.")
else:
print(f"Fail to reject the null hypothesis (p={p_value:.4f}). There is not enough evidence to say the mean processing time is different from {mu_0} ms.")
The output provides the t-statistic and the p-value. The t-statistic measures how many standard errors the sample mean is away from the hypothesized mean. The p-value tells us the probability of observing a sample mean as extreme as, or more extreme than, the one calculated, assuming the null hypothesis is true. We compare the p-value to our chosen significance level (α) to make a decision.
This test compares the means of two independent groups to see if they are significantly different. For example, we might compare the performance scores of users using two different interfaces (Group A vs. Group B). We use scipy.stats.ttest_ind
.
import numpy as np
from scipy import stats
import plotly.graph_objects as go
# Sample performance scores for two independent groups
scores_A = np.array([85, 90, 78, 88, 92, 81, 87, 89, 80, 84])
scores_B = np.array([75, 82, 70, 79, 85, 72, 77, 81, 74, 78])
# Optional: Visualize the distributions
fig = go.Figure()
fig.add_trace(go.Box(y=scores_A, name='Group A', marker_color='#1f77b4')) # blue
fig.add_trace(go.Box(y=scores_B, name='Group B', marker_color='#ff7f0e')) # orange
fig.update_layout(title='Performance Scores by Group', yaxis_title='Score', height=400, width=500)
# fig.show() # In a notebook environment
# Perform the independent two-sample t-test
# By default, it assumes equal variances (Welch's t-test is performed if equal_var=False)
t_statistic, p_value = stats.ttest_ind(a=scores_A, b=scores_B, equal_var=True)
print(f"Mean Score Group A: {np.mean(scores_A):.2f}")
print(f"Mean Score Group B: {np.mean(scores_B):.2f}")
print(f"T-statistic: {t_statistic:.4f}")
print(f"P-value: {p_value:.4f}")
# Interpretation
alpha = 0.05
if p_value < alpha:
print(f"Reject the null hypothesis (p={p_value:.4f}). The mean scores of the two groups are significantly different.")
else:
print(f"Fail to reject the null hypothesis (p={p_value:.4f}). There is not enough evidence to say the mean scores are different.")
Box plot comparing the distribution of performance scores for Group A and Group B. Group A appears to have higher scores on average.
The ttest_ind
function returns the t-statistic and p-value. The parameter equal_var
controls whether to assume equal variances between the two groups (Student's t-test) or not (Welch's t-test, the default in newer SciPy versions might differ, so check documentation). Welch's t-test is often preferred as it doesn't require the assumption of equal variances.
This test is used when the two samples are dependent or paired. This often occurs when measuring the same subject under two different conditions (e.g., before and after a treatment, or performance on task A vs. task B by the same person). We use scipy.stats.ttest_rel
.
Imagine we measure the response time of 10 participants before and after they complete a training module. We want to know if the training significantly reduced their response time.
import numpy as np
from scipy import stats
# Response times (in seconds) before and after training for 10 participants
response_before = np.array([5.2, 4.8, 6.0, 5.5, 5.1, 5.8, 4.9, 5.4, 5.6, 5.0])
response_after = np.array([4.7, 4.5, 5.5, 5.0, 4.6, 5.2, 4.4, 4.9, 5.0, 4.7])
# Calculate the differences
differences = response_before - response_after
# Perform the paired t-test
t_statistic, p_value = stats.ttest_rel(a=response_before, b=response_after)
print(f"Mean Difference (Before - After): {np.mean(differences):.2f}")
print(f"T-statistic: {t_statistic:.4f}")
print(f"P-value: {p_value:.4f}")
# Interpretation (assuming a two-tailed test, often we'd do one-tailed for improvement)
# For a one-tailed test (e.g., before > after), divide p_value by 2 if t_statistic is positive
p_value_one_tailed = p_value / 2
alpha = 0.05
if p_value_one_tailed < alpha and t_statistic > 0: # Check direction for one-tailed
print(f"Reject the null hypothesis (p_one_tailed={p_value_one_tailed:.4f}). Response time significantly decreased after training.")
else:
print(f"Fail to reject the null hypothesis (p_one_tailed={p_value_one_tailed:.4f}). No significant decrease in response time observed.")
The paired t-test essentially performs a one-sample t-test on the differences between the paired observations, testing if the mean difference is significantly different from zero.
Chi-squared (χ2) tests are used for categorical data. They help determine if observed frequencies differ significantly from expected frequencies or if there's an association between two categorical variables.
This test determines if a sample distribution of categorical data matches an expected distribution. For instance, does the distribution of user choices for different website layouts match a hypothesized distribution (e.g., equal preference)? We use scipy.stats.chisquare
.
Suppose we expect users to choose between three layouts (A, B, C) with equal preference (1/3 each). We observe the choices of 150 users.
import numpy as np
from scipy import stats
# Observed frequencies of choices for layouts A, B, C
observed_frequencies = np.array([60, 50, 40]) # Total = 150
# Expected frequencies based on equal preference (150 users / 3 layouts)
expected_frequencies = np.array([50, 50, 50])
# Perform the Chi-squared goodness-of-fit test
chi2_statistic, p_value = stats.chisquare(f_obs=observed_frequencies, f_exp=expected_frequencies)
print(f"Observed Frequencies: {observed_frequencies}")
print(f"Expected Frequencies: {expected_frequencies}")
print(f"Chi-squared Statistic: {chi2_statistic:.4f}")
print(f"P-value: {p_value:.4f}")
# Interpretation
alpha = 0.05
if p_value < alpha:
print(f"Reject the null hypothesis (p={p_value:.4f}). The observed distribution significantly differs from the expected distribution.")
else:
print(f"Fail to reject the null hypothesis (p={p_value:.4f}). The observed distribution is consistent with the expected distribution.")
The function returns the χ2 statistic and the p-value. A significant result suggests the observed pattern of choices is unlikely if the preferences were truly equal.
This test checks whether two categorical variables are associated or independent. For example, is there an association between user segment (e.g., 'New', 'Returning') and preferred product category (e.g., 'Electronics', 'Clothing', 'Home')? We use scipy.stats.chi2_contingency
.
We need a contingency table (cross-tabulation) of the observed frequencies for the two variables.
import numpy as np
from scipy import stats
import pandas as pd
# Contingency table: Observed frequencies (e.g., User Segment vs. Product Category)
# Rows: New, Returning
# Columns: Electronics, Clothing, Home
observed_table = pd.DataFrame({
'Electronics': [50, 80],
'Clothing': [70, 60],
'Home': [30, 90]
}, index=['New', 'Returning'])
print("Contingency Table (Observed Frequencies):")
print(observed_table)
# Perform the Chi-squared test of independence
chi2_statistic, p_value, dof, expected_table = stats.chi2_contingency(observed=observed_table)
print(f"\nChi-squared Statistic: {chi2_statistic:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Degrees of Freedom: {dof}")
print("\nExpected Frequencies Table:")
print(pd.DataFrame(expected_table, index=observed_table.index, columns=observed_table.columns).round(2))
# Interpretation
alpha = 0.05
if p_value < alpha:
print(f"\nReject the null hypothesis (p={p_value:.4f}). There is a significant association between user segment and preferred product category.")
else:
print(f"\nFail to reject the null hypothesis (p={p_value:.4f}). There is no significant association found between the variables.")
chi2_contingency
returns the χ2 statistic, p-value, degrees of freedom (dof), and the table of expected frequencies under the null hypothesis of independence. A significant p-value indicates that the two variables are likely related.
Analysis of Variance (ANOVA) is used to compare the means of three or more groups. The simplest form is one-way ANOVA, which analyzes the effect of one categorical factor on a continuous response variable. For example, comparing the mean effectiveness score across three different ad campaigns. We use scipy.stats.f_oneway
.
import numpy as np
from scipy import stats
# Effectiveness scores for three different ad campaigns
scores_campaign1 = np.array([7, 8, 6, 9, 7, 8])
scores_campaign2 = np.array([5, 6, 4, 5, 6, 5])
scores_campaign3 = np.array([9, 10, 8, 11, 9, 10])
# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(scores_campaign1, scores_campaign2, scores_campaign3)
print(f"Effectiveness Scores - Campaign 1 Mean: {np.mean(scores_campaign1):.2f}")
print(f"Effectiveness Scores - Campaign 2 Mean: {np.mean(scores_campaign2):.2f}")
print(f"Effectiveness Scores - Campaign 3 Mean: {np.mean(scores_campaign3):.2f}")
print(f"\nF-statistic: {f_statistic:.4f}")
print(f"P-value: {p_value:.4f}")
# Interpretation
alpha = 0.05
if p_value < alpha:
print(f"\nReject the null hypothesis (p={p_value:.4f}). There is a significant difference in mean effectiveness scores among the campaigns.")
else:
print(f"\nFail to reject the null hypothesis (p={p_value:.4f}). No significant difference found in mean effectiveness scores.")
The f_oneway
function returns the F-statistic (a ratio of variance between groups to variance within groups) and the p-value. A significant result suggests that at least one group mean is different from the others. Note that ANOVA tells you if there's a difference, but not which specific groups differ. Post-hoc tests (like Tukey's HSD) are needed for pairwise comparisons if the ANOVA result is significant.
Regardless of the test performed, the core interpretation relies on the p-value:
Remember that "failing to reject H0" does not mean H0 is true, only that the data does not provide strong enough evidence against it at the chosen significance level.
These SciPy functions provide a powerful toolkit for implementing hypothesis tests in Python. Selecting the correct test depends on the type of data (continuous, categorical), the number of groups being compared, and whether the samples are independent or paired. Always consider the assumptions underlying each test before applying it to your data.
© 2025 ApX Machine Learning