Okay, we've looked at ways to summarize data using single numbers like the mean, median, and standard deviation. These statistics give us a snapshot of the data's center and spread. But often, we need a more complete picture. We want to understand how the data values are distributed. How often do different values appear? Are certain values very common while others are rare? This is where frequency distributions come in.
A frequency distribution is essentially a summary that shows how many times (the frequency) each different value appears in a dataset. Think of it as organizing your data by counting occurrences. This helps you see patterns that single summary statistics might hide.
The simplest way to represent a frequency distribution is with a frequency table. Let's imagine a small dataset representing the number of coffees purchased by 20 different customers in a week:
2, 3, 1, 0, 2, 4, 3, 2, 1, 0, 5, 2, 3, 1, 2, 0, 4, 2, 3, 2
To create a frequency table, we first list the unique values present in the data and then count how many times each value appears.
Coffees Purchased (Value) | Tally | Frequency (Count) |
---|---|---|
0 | III | 3 |
1 | III | 3 |
2 | IIIII II | 7 |
3 | IIII | 4 |
4 | II | 2 |
5 | I | 1 |
Total | 20 |
This table immediately tells us more than just the average. We can see that buying 2 coffees per week is the most common behavior (the mode), while buying 5 is quite rare.
Sometimes, just knowing the count isn't enough. We might want to know the proportion or percentage of the total that each value represents. This is called the relative frequency. You calculate it by dividing the frequency of a value by the total number of observations.
We can also add cumulative frequency, which is the running total of frequencies as you go down the list of values. It tells you how many observations fall at or below a certain value.
Let's add these to our coffee example:
Coffees Purchased | Frequency | Relative Frequency (Frequency / 20) | Cumulative Frequency |
---|---|---|---|
0 | 3 | 3 / 20 = 0.15 (or 15%) | 3 |
1 | 3 | 3 / 20 = 0.15 (or 15%) | 3 + 3 = 6 |
2 | 7 | 7 / 20 = 0.35 (or 35%) | 6 + 7 = 13 |
3 | 4 | 4 / 20 = 0.20 (or 20%) | 13 + 4 = 17 |
4 | 2 | 2 / 20 = 0.10 (or 10%) | 17 + 2 = 19 |
5 | 1 | 1 / 20 = 0.05 (or 5%) | 19 + 1 = 20 |
Total | 20 | 1.00 (or 100%) |
Now we can easily see that 35% of the customers bought exactly 2 coffees, and 13 out of 20 customers (or 65%, calculated from the cumulative relative frequency) bought 2 or fewer coffees.
What if you have data with many unique values, or continuous data (like height or temperature)? Listing every single value might make the table too long and not very useful. In these cases, we create a grouped frequency distribution.
We group the data into ranges or intervals, often called bins or classes. Then, we count how many data points fall into each bin.
For example, if we had test scores for 30 students ranging from 55 to 98, we might group them like this:
Score Range (Bin) | Frequency |
---|---|
50-59 | 2 |
60-69 | 5 |
70-79 | 11 |
80-89 | 8 |
90-99 | 4 |
Total | 30 |
Choosing the right bin size is important. Too few bins might hide important details, while too many might make the pattern hard to see. There's no single perfect rule, and it often involves some judgment based on the data.
Frequency tables are useful, but visualizing the distribution often makes patterns even clearer.
For numerical data (especially continuous data or discrete data grouped into bins), the standard visualization is a histogram.
A histogram looks like a bar chart, but there are important differences:
Let's visualize the grouped test score data:
Histogram showing the frequency of student scores within defined ranges. The tallest bar indicates the most frequent score range (70-79).
Looking at a histogram allows us to quickly assess the shape of the distribution. Is it roughly symmetrical (like a bell curve)? Is it skewed, with a long tail extending to the right (positively skewed) or left (negatively skewed)? Does it have one peak (unimodal) or multiple peaks (multimodal)?
For categorical data (like types of pets, favorite colors, or product categories), we use a bar chart. It looks similar to a histogram, but:
Let's imagine data on favorite types of programming projects among a group of learners:
Bar chart displaying the popularity of different programming project types among learners. Data Analysis is the most favored category.
Understanding frequency distributions is a fundamental part of Exploratory Data Analysis (EDA). It moves beyond single summary numbers to show the underlying structure of your data. By examining frequency tables and visualizations like histograms or bar charts, you can:
This understanding is essential before applying more complex statistical methods or building machine learning models. It helps you check assumptions, choose appropriate techniques, and ultimately, draw more accurate conclusions from your data.
© 2025 ApX Machine Learning