Okay, you've set up your null hypothesis (H0) and alternative hypothesis (H1). You understand that when making a decision based on your sample data, you might make a Type I error (rejecting a true H0) or a Type II error (failing to reject a false H0). But how do we actually decide whether the evidence from our data is strong enough to reject H0? This is where the p-value comes in.
The p-value is a probability that measures the strength of evidence against the null hypothesis. Formally:
The p-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis (H0) is correct.
Think of it this way: you perform your experiment or collect your data, calculate a test statistic (like a t-score or chi-squared value, which we'll cover soon), and then ask: "If the null hypothesis were actually true (e.g., the new model is not better than the old one), how likely would it be to see a result this extreme, or even more extreme, just by random chance?" That likelihood is the p-value.
The interpretation hinges on how small the p-value is:
To make a formal decision, we compare the p-value to a pre-determined threshold called the significance level, denoted by α (alpha). This α is the probability of making a Type I error that we are willing to tolerate. Common choices for α are 0.05 (5%), 0.01 (1%), or 0.10 (10%).
The decision rule is straightforward:
Important: Notice we say "fail to reject H0", not "accept H0". Hypothesis testing is designed to see if there's enough evidence to disprove the null hypothesis, not to prove it's true. A large p-value simply means our test wasn't sensitive enough, or there truly isn't an effect to detect with the current data.
Calculating the exact p-value involves comparing your calculated test statistic (derived from your sample data) to the known probability distribution of that statistic assuming the null hypothesis is true.
For example, in a t-test, you calculate a t-score. The p-value is the area under the t-distribution curve that is more extreme than your calculated t-score.
The shaded areas represent the p-value (split in two for a two-tailed test). It's the probability of observing a test statistic as extreme as or more extreme than the calculated statistic (vertical dashed lines, Statisticobs), assuming H0 is true.
The good news is you rarely need to calculate these areas manually. Statistical software and libraries like Python's scipy.stats
handle these calculations for you. Your job is to understand the input (your data, your hypotheses) and correctly interpret the output (the p-value).
P-values are powerful, but they are also frequently misunderstood. Keep these points in mind:
In machine learning, you'll encounter p-values when:
Understanding p-values allows you to move beyond just observing differences and make statistically grounded claims about whether those differences are likely real or just due to random chance. In the following sections, we'll look at specific tests (like t-tests and Chi-squared tests) that produce these p-values.
© 2025 ApX Machine Learning