After setting up the framework for hypothesis testing, understanding errors, and interpreting p-values, we can now look at specific statistical tests. One of the most common families of tests used for comparing means is the t-test.
T-tests are particularly useful when you're working with sample data and you don't know the standard deviation of the entire population. This is a very frequent scenario in real-world data analysis and machine learning. Instead of the population standard deviation, t-tests rely on the sample standard deviation. They are based on the Student's t-distribution, which is similar in shape to the normal distribution but has heavier tails. This means it accounts for the additional uncertainty introduced by estimating the population standard deviation from the sample, especially when sample sizes are small. As the sample size increases, the t-distribution approaches the normal distribution.
Comparison of the standard normal distribution (Z) with Student's t-distributions for 2 and 10 degrees of freedom (df). Note the heavier tails of the t-distributions, especially with lower df, accounting for greater uncertainty in small samples.
There are three main types of t-tests, each suited for a different comparison scenario:
One-Sample T-test
This test compares the mean of a single sample (xˉ) to a known or hypothesized population mean (μ0). It helps answer questions like: "Is the average performance score of my model significantly different from the required benchmark of 0.85?"
Hypotheses:
Null (H0): The sample mean is equal to the hypothesized population mean (H0:μ=μ0).
Alternative (H1): The sample mean is different from (H1:μ=μ0), greater than (H1:μ>μ0), or less than (H1:μ<μ0) the hypothesized population mean.
Test Statistic: The t-statistic measures how many standard errors the sample mean is away from the hypothesized mean.
t=s/nxˉ−μ0
where xˉ is the sample mean, μ0 is the hypothesized population mean, s is the sample standard deviation, and n is the sample size.
Degrees of Freedom (df):df=n−1.
Assumptions: The data should be approximately normally distributed (especially important for small n), and observations should be independent.
Two-Sample T-test (Independent Samples)
This test compares the means of two independent groups (Group 1 and Group 2) to see if there is a statistically significant difference between them. This is very common in A/B testing, for example, comparing the conversion rates of two website versions (Group A vs. Group B).
Hypotheses:
Null (H0): The means of the two populations are equal (H0:μ1=μ2).
Alternative (H1): The means are different (H1:μ1=μ2), or one is greater/less than the other (H1:μ1>μ2 or H1:μ1<μ2).
Test Statistic: The calculation depends on whether you assume the variances of the two groups are equal.
Equal Variances (Student's t-test): Uses a pooled standard deviation. The degrees of freedom are df=n1+n2−2.
Unequal Variances (Welch's t-test): Does not assume equal variances and adjusts the degrees of freedom using the Welch-Satterthwaite equation. This test is often preferred in practice as the assumption of equal variances can be hard to verify and Welch's test performs well even when variances are similar. Most software packages default to or provide an option for Welch's test.
Assumptions: Both samples should be approximately normally distributed, the samples must be independent, and data should be continuous. For Student's t-test, equal variances are also assumed.
Paired Sample T-test (Dependent Samples)
This test is used when the measurements come in pairs, meaning each observation in one sample is directly related to an observation in the other sample. Common scenarios include measuring the same subjects before and after an intervention (e.g., model performance before and after a feature update) or comparing two different treatments applied to the same subject. It tests if the mean difference between the paired observations is significantly different from zero.
Hypotheses:
Null (H0): The mean difference between paired observations is zero (H0:μd=0).
Alternative (H1): The mean difference is not zero (H1:μd=0), positive (H1:μd>0), or negative (H1:μd<0).
Test Statistic: Calculated on the differences (di) between the paired observations.
t=sd/ndˉ−0
where dˉ is the mean of the differences, sd is the standard deviation of the differences, and n is the number of pairs.
Degrees of Freedom (df):df=n−1.
Assumptions: The differences between the pairs should be approximately normally distributed, and the pairs must be dependent.
Making Decisions with T-tests
Regardless of the type, the process is similar:
Calculate the t-statistic from your sample data.
Determine the degrees of freedom.
Using the t-statistic and degrees of freedom, find the corresponding p-value. This p-value represents the probability of observing a t-statistic as extreme as, or more extreme than, the one calculated, assuming the null hypothesis is true.
Compare the p-value to your chosen significance level (α). If p≤α, you reject the null hypothesis (H0) in favor of the alternative hypothesis (H1). If p>α, you fail to reject the null hypothesis.
T-tests provide a robust way to compare means when population standard deviations are unknown. Understanding which type of t-test to apply based on your data structure (one sample, two independent samples, or paired samples) is essential for drawing valid conclusions from your experiments and data analyses. We will see how to perform these tests efficiently using Python libraries in a later section.