While point estimation gives us a single best guess for a population parameter, and confidence intervals provide a range of plausible values, sometimes we need to make a more direct decision about a specific claim or assumption regarding the population. This is where hypothesis testing comes in. It provides a formal procedure for using sample data to decide between two competing statements about a population characteristic.
Think of it like a trial in a courtroom. There's an initial assumption (e.g., "innocent until proven guilty"), which we call the null hypothesis. Then, evidence (sample data) is presented. Based on the strength of this evidence, a decision is made: either stick with the initial assumption or reject it in favor of an alternative conclusion (the alternative hypothesis). Hypothesis testing in statistics works on a similar principle.
The core idea is to start with a specific claim about the population, often representing a default state, status quo, or "no effect." For example:
This initial claim is the one we'll put "on trial." We then collect sample data relevant to this claim. The critical question becomes:
"If the initial claim (the null hypothesis) were actually true, how likely is it that we would observe sample data like ours (or even more extreme) just by random chance?"
If our sample data looks very typical or reasonably likely under the assumption that the initial claim is true, we don't have strong evidence to discard the claim. We "fail to reject" the null hypothesis (similar to a "not guilty" verdict; it doesn't necessarily prove innocence, but the evidence wasn't strong enough for a conviction).
However, if our sample data looks very unusual or highly unlikely assuming the initial claim is true, it casts doubt on that claim. It suggests that the initial assumption might be wrong. In this case, we have statistically significant evidence to "reject" the null hypothesis in favor of the alternative conclusion.
Let's revisit the website redesign example. Suppose the old website design had an average user session duration of 3 minutes. We want to test if a new design increases this duration.
If calculations (which we'll learn more about soon) show that getting a sample average of 4.5 minutes is extremely unlikely if the true average were still 3 minutes, we would reject the initial "no effect" claim. We'd conclude that the evidence supports the idea that the new design increases session duration. If, however, a sample average of 4.5 minutes is reasonably plausible even if the true average is 3 minutes (perhaps due to high variability or a small sample size), we wouldn't have enough evidence to reject the initial claim.
Here's a simplified view of the decision process:
A simplified flow of hypothesis testing: Formulate a claim, gather data, assess how surprising the data is if the claim is true, and make a decision.
Hypothesis testing doesn't prove anything with absolute certainty. It's about weighing evidence and making decisions based on probabilities. It provides a structured way to assess whether the patterns we see in our sample data are strong enough to conclude that a real effect or difference exists in the larger population, moving beyond mere description or estimation towards making informed judgments. In the upcoming sections, we'll formalize these ideas by defining the null and alternative hypotheses precisely and introducing the concept of the p-value to help make the decision.
© 2025 ApX Machine Learning