T-tests are workhorses for comparing the means of one or two groups, a common task in analyzing experiment results or model performance metrics. Python's SciPy library will be used to perform both one-sample and two-sample t-tests on simulated data.Scenario Setup: Processing Times and A/B TestingImagine two common scenarios in machine learning or data analysis:One-Sample Scenario: We have developed a new image processing algorithm. We know the standard algorithm takes an average of 150 milliseconds (ms) per image. We run our new algorithm on a sample of 30 images and want to know if its average processing time is significantly different from the standard 150ms benchmark.Two-Sample Scenario: We are running an A/B test on a website. Group A sees the original design, and Group B sees a new design. We measure the session duration (in minutes) for users in each group. We want to determine if there's a statistically significant difference in the average session duration between the two designs.Generating Sample DataLet's simulate some data for these scenarios using NumPy. For the one-sample test, we'll simulate 30 processing times. For the two-sample test, we'll simulate session durations for 50 users in Group A and 55 users in Group B.import numpy as np from scipy import stats # Seed for reproducibility np.random.seed(42) # --- One-Sample Scenario Data --- # Standard processing time benchmark (population mean under H0) benchmark_time = 150 # Sample data for our new algorithm (30 images) # Let's assume our algorithm is slightly faster on average, e.g., around 145ms sample_processing_times = np.random.normal(loc=145, scale=12, size=30) print(f"Sample Mean Processing Time: {np.mean(sample_processing_times):.2f} ms") # --- Two-Sample Scenario Data --- # Group A (Original Design) - 50 users # Assume average session duration is 5 minutes with some variance group_a_durations = np.random.normal(loc=5.0, scale=1.5, size=50) # Group B (New Design) - 55 users # Assume average session duration is potentially higher, e.g., 5.8 minutes group_b_durations = np.random.normal(loc=5.8, scale=1.7, size=55) print(f"Group A Mean Duration: {np.mean(group_a_durations):.2f} mins") print(f"Group B Mean Duration: {np.mean(group_b_durations):.2f} mins")This setup gives us realistic-looking data to work with. np.random.normal generates data following a normal distribution with a specified mean (loc) and standard deviation (scale).Applying the One-Sample T-testFor the first scenario, we want to test if our sample mean is significantly different from the benchmark of 150ms.Null Hypothesis ($H_0$): The true mean processing time of the new algorithm is equal to 150ms ($\mu = 150$).Alternative Hypothesis ($H_1$): The true mean processing time of the new algorithm is different from 150ms ($\mu \neq 150$). This is a two-tailed test.We use the scipy.stats.ttest_1samp function. It takes the sample data and the population mean under the null hypothesis as arguments.# Perform the one-sample t-test t_statistic_1samp, p_value_1samp = stats.ttest_1samp( a=sample_processing_times, popmean=benchmark_time ) print(f"One-Sample T-test Results:") print(f" T-statistic: {t_statistic_1samp:.4f}") print(f" P-value: {p_value_1samp:.4f}") # Interpretation alpha = 0.05 # Significance level if p_value_1samp < alpha: print(f" Conclusion: Reject H0. The mean processing time ({np.mean(sample_processing_times):.2f} ms) is significantly different from {benchmark_time} ms (p={p_value_1samp:.4f}).") else: print(f" Conclusion: Fail to reject H0. There is not enough evidence to say the mean processing time is different from {benchmark_time} ms (p={p_value_1samp:.4f}).")Interpreting the Output: The function returns the calculated t-statistic and the corresponding p-value.T-statistic: Measures how many standard errors the sample mean is away from the hypothesized population mean. A larger absolute value suggests a greater difference.P-value: The probability of observing a sample mean as extreme as (or more extreme than) the one calculated, assuming the null hypothesis is true.If the p-value is less than our chosen significance level $\alpha$ (commonly 0.05), we reject the null hypothesis ($H_0$). This suggests the observed difference is statistically significant. Otherwise, we fail to reject $H_0$. Based on our simulated data (which was centered around 145ms), we likely find a significant difference from 150ms.Applying the Two-Sample Independent T-testNow, let's address the A/B test scenario. We want to compare the means of two independent groups (Group A and Group B).Null Hypothesis ($H_0$): The true mean session duration for Group A is equal to the true mean session duration for Group B ($\mu_A = \mu_B$).Alternative Hypothesis ($H_1$): The true mean session durations are different ($\mu_A \neq \mu_B$). Again, a two-tailed test.We use the scipy.stats.ttest_ind function. It takes the data for both groups as input. An important parameter is equal_var. By default, it's set to False, performing Welch's t-test, which does not assume equal variances between the two groups. This is generally safer unless you have strong reasons to believe the variances are equal.# Perform the two-sample independent t-test (Welch's t-test by default) t_statistic_ind, p_value_ind = stats.ttest_ind( a=group_a_durations, b=group_b_durations, equal_var=False # Perform Welch's t-test ) print(f"\nTwo-Sample Independent T-test Results:") print(f" T-statistic: {t_statistic_ind:.4f}") print(f" P-value: {p_value_ind:.4f}") # Interpretation alpha = 0.05 # Significance level if p_value_ind < alpha: print(f" Conclusion: Reject H0. There is a statistically significant difference in mean session duration between Group A ({np.mean(group_a_durations):.2f} mins) and Group B ({np.mean(group_b_durations):.2f} mins) (p={p_value_ind:.4f}).") else: print(f" Conclusion: Fail to reject H0. There is not enough evidence to claim a significant difference in mean session duration between the groups (p={p_value_ind:.4f}).") Visualizing the A/B Test DataA quick visualization can help understand the data being compared. Box plots are effective for showing the distribution and central tendency of each group.{"layout": {"title": {"text": "Session Duration Comparison (A/B Test)"}, "xaxis": {"title": {"text": "Group"}}, "yaxis": {"title": {"text": "Session Duration (minutes)"}}, "boxmode": "group", "margin": {"l": 40, "r": 20, "t": 40, "b": 40}, "height": 350}, "data": [{"type": "box", "y": [10.2, 11.5, 9.8, 12.1, 10.5, 9.5, 11.0, 10.8, 11.8, 10.0, 9.9, 10.7, 11.3, 10.6, 12.0], "name": "Group A", "marker": {"color": "#228be6"}}, {"type": "box", "y": [13.5, 14.2, 12.8, 15.1, 13.9, 12.5, 14.0, 13.8, 14.8, 13.0, 12.9, 13.7, 14.3, 13.6, 15.0], "name": "Group B", "marker": {"color": "#fd7e14"}}]}Box plots showing the distribution of session durations for Group A (original design) and Group B (new design). The central line is the median, the box represents the interquartile range (IQR), and whiskers typically extend to 1.5 times the IQR.Interpreting the Two-Sample Results: Similar to the one-sample test, we compare the p-value to our significance level $\alpha$. If $p < \alpha$, we conclude that the difference in means observed between Group A and Group B is statistically significant. Given how we generated the data (Group B mean set to 5.8 vs. Group A mean set to 5.0), we expect the test to detect this difference.Assumptions and ApproachesAssumptions: T-tests rely on certain assumptions. For the one-sample t-test, the data should be approximately normally distributed, especially for small sample sizes (n < 30). For the two-sample independent t-test, both groups should be approximately normally distributed, and the samples should be independent. Welch's t-test relaxes the assumption of equal variances. While t-tests are relatively robust to moderate violations of normality with larger sample sizes (thanks to the Central Limit Theorem), it's good practice to check these assumptions in analysis (e.g., using normality tests or visualizations like Q-Q plots).Paired T-test: If the two samples were related (e.g., measuring the same subject before and after an intervention), we would use a paired t-test (scipy.stats.ttest_rel), which analyzes the differences between paired observations.Effect Size: A significant p-value tells you there's likely a real difference, but not how large or practically important that difference is. Calculating effect size measures (like Cohen's d) complements the p-value by quantifying the magnitude of the difference.This practice section demonstrated how to apply one-sample and two-sample t-tests using SciPy. These tests are fundamental tools for comparing means, enabling data-driven decisions in contexts ranging from algorithm evaluation to A/B testing. Remember to interpret the p-value correctly within the context of your hypotheses and chosen significance level.