While measures like the mean and standard deviation tell us about the center and overall spread of our data, they don't give us a sense of where a particular data point stands relative to others. Sometimes, we want to know what value marks the bottom 25% of the data, or what constitutes the top 10%. This is where percentiles and quartiles become useful tools for describing the position of values within a distribution.
Imagine you have a list of exam scores for a class, sorted from lowest to highest. A percentile is a measure that tells you what percentage of the total data points fall below a specific value.
For example, if your score is at the 80th percentile, it means that 80% of the scores in the class are lower than yours, and 20% are higher. The 50th percentile is the value where 50% of the data falls below it – you might recognize this as the median, which we discussed earlier as a measure of central tendency.
Percentiles give context to individual data points and help us understand the shape of the distribution. The 10th percentile marks the lower end of the data, while the 90th percentile marks the upper end.
Quartiles are specific, commonly used percentiles that divide the dataset into four equal parts. Think of them as cutting your sorted data at the 25%, 50%, and 75% marks.
Together, Q1, Q2 (the median), and Q3 give us a concise summary of the data's distribution, indicating where the bulk of the values lie.
Using the quartiles, we can calculate another measure of spread called the Interquartile Range (IQR). It's simply the difference between the third quartile (Q3) and the first quartile (Q1):
IQR=Q3−Q1The IQR represents the range spanned by the middle 50% of the data. Why is this useful? Unlike the overall range (maximum - minimum), the IQR is not affected by extreme outliers. If you have one very high or very low value, it won't change Q1 or Q3 significantly, making the IQR a more robust measure of spread for skewed datasets or data with outliers.
Manually sorting data and finding percentiles is tedious. Python libraries like NumPy and Pandas make this straightforward.
Using NumPy:
The numpy.percentile()
function is commonly used. You provide the data and the percentile(s) you want (as values between 0 and 100).
import numpy as np
# Sample data (e.g., response times in milliseconds)
response_times = np.array([120, 150, 155, 160, 175, 180, 190, 210, 230, 250, 300, 500])
# Calculate the 25th percentile (Q1)
q1 = np.percentile(response_times, 25)
# Calculate the 50th percentile (Median, Q2)
median = np.percentile(response_times, 50) # Or np.median(response_times)
# Calculate the 75th percentile (Q3)
q3 = np.percentile(response_times, 75)
# Calculate the IQR
iqr = q3 - q1
print(f"Data: {response_times}")
print(f"Q1 (25th Percentile): {q1}")
print(f"Median (50th Percentile): {median}")
print(f"Q3 (75th Percentile): {q3}")
print(f"IQR (Q3 - Q1): {iqr}")
Using Pandas:
If your data is in a Pandas Series or DataFrame, you can use the .quantile()
method. Note that it expects quantiles as values between 0 and 1. The .describe()
method also conveniently includes Q1 (25%), median (50%), and Q3 (75%).
import pandas as pd
# Sample data in a Pandas Series
response_times_pd = pd.Series([120, 150, 155, 160, 175, 180, 190, 210, 230, 250, 300, 500])
# Calculate specific quantiles
q1_pd = response_times_pd.quantile(0.25)
median_pd = response_times_pd.quantile(0.50) # Or response_times_pd.median()
q3_pd = response_times_pd.quantile(0.75)
iqr_pd = q3_pd - q1_pd
print(f"\nUsing Pandas:")
print(f"Q1 (25th Percentile): {q1_pd}")
print(f"Median (50th Percentile): {median_pd}")
print(f"Q3 (75th Percentile): {q3_pd}")
print(f"IQR (Q3 - Q1): {iqr_pd}")
# Using .describe() for a summary including quartiles
print("\nPandas describe() output:")
print(response_times_pd.describe())
Let's look at the response_times
example results (NumPy and Pandas give slightly different results due to interpolation methods, but the concept is the same. We'll use NumPy's results here: Q1=158.75, Median=185.0, Q3=235.0).
This gives us a much better picture than just the mean or standard deviation alone. We can see the central tendency (median) and the spread of the bulk of the data (IQR), helping us identify potential skewness or unusual values (like the 500 ms response time).
Here's a simple visualization showing how quartiles divide the data points:
Histogram of sample response times with vertical lines indicating the first quartile (Q1), median (Q2), and third quartile (Q3). Each section between the start, Q1, median, Q3, and end ideally contains 25% of the data points (though binning effects may alter exact counts per bar).
Understanding percentiles and quartiles provides a more granular view of your data's distribution. They are fundamental concepts that feed directly into creating and interpreting visualizations like box plots, which we will cover next.
© 2025 ApX Machine Learning