While measures like the mean and standard deviation tell us about the center and spread of our data, they don't fully describe the position of specific values within the distribution. Percentiles and quartiles provide this valuable context, helping us understand relative standing and identify potential unusual observations.
A percentile is a measure indicating the value below which a given percentage of observations in a group falls. For instance, the Pk percentile is the value v such that k percent of the data points are less than or equal to v.
If you scored in the 85th percentile on a standardized test, it means 85% of the test-takers scored at or below your score. The 50th percentile, P50, is a familiar concept. It's the value that splits the data in half, which is precisely the definition of the median, a measure of central tendency we discussed earlier.
Calculating the exact percentile value can involve different interpolation methods when the desired percentile falls between two data points. However, the core concept remains the same: it ranks a value relative to the rest of the dataset.
Quartiles are specific, commonly used percentiles that divide the sorted dataset into four equal parts.
Together, Q1, Q2 (the median), and Q3 split the data into four segments, each containing approximately 25% of the observations.
Data: |--- 25% ---|--- 25% ---|--- 25% ---|--- 25% ---|
^ ^ ^ ^ ^
Min Q1 Q2 (Median) Q3 Max
The distance between the first and third quartiles is known as the Interquartile Range (IQR).
IQR=Q3−Q1
The IQR measures the spread of the middle 50% of the data. It's a particularly useful measure of dispersion because, like the median, it is robust to outliers. Extreme values at the high or low ends of the dataset do not affect the IQR, unlike the range or standard deviation which can be heavily influenced by single anomalous points. A smaller IQR indicates that the central half of the data is tightly clustered, while a larger IQR suggests more variability in the middle portion of the distribution.
The IQR provides a common statistical rule for identifying potential outliers. Data points that fall significantly outside the range defined by Q1 and Q3 might warrant further investigation. The standard guideline defines potential outliers as observations falling below Q1−1.5×IQR or above Q3+1.5×IQR.
Any data point outside these bounds is often flagged as a potential outlier. This doesn't automatically mean the data point is erroneous or should be removed; it simply highlights it as unusual compared to the bulk of the data. Context is always important when deciding how to handle outliers.
These calculations form the basis for box plots (or box-and-whisker plots), a powerful visualization tool we'll encounter later. Box plots graphically represent the median, Q1, Q3, the IQR, and potential outliers.
Example box plot showing the median (center line), Q1 and Q3 (edges of the box), whiskers (extending to data within 1.5×IQR from the box), and an outlier (individual point).
Manually calculating percentiles and quartiles for large datasets is tedious. Python libraries like Pandas provide efficient functions. The quantile()
method of a Pandas Series or DataFrame column is commonly used.
import pandas as pd
# Sample data representing scores
data = pd.Series([68, 75, 77, 82, 85, 88, 90, 91, 93, 95, 99, 105])
# Calculate the 75th percentile (Q3)
q3 = data.quantile(0.75)
print(f"Q3 (75th percentile): {q3}")
# Calculate the 25th percentile (Q1)
q1 = data.quantile(0.25)
print(f"Q1 (25th percentile): {q1}")
# Calculate the Interquartile Range (IQR)
iqr = q3 - q1
print(f"IQR: {iqr}")
# Calculate specific percentiles, e.g., 90th percentile
p90 = data.quantile(0.90)
print(f"90th percentile: {p90}")
# Using describe() provides Q1, median (50%), and Q3
print("\nSummary Statistics including Quartiles:")
print(data.describe())
Output:
Q3 (75th percentile): 93.5
Q1 (25th percentile): 80.75
IQR: 12.75
90th percentile: 97.4
Summary Statistics including Quartiles:
count 12.000000
mean 87.333333
std 11.271062
min 68.000000
25% 80.750000
50% 89.000000
75% 93.500000
max 105.000000
dtype: float64
As seen in the output, data.quantile(0.75)
returns 93.5, indicating 75% of the scores are at or below this value. The describe()
method conveniently outputs the 25th (Q1), 50th (median/Q2), and 75th (Q3) percentiles along with other summary statistics.
Percentiles and quartiles enrich our understanding of data distributions by providing information about relative positions and the spread of the central data portion. They are fundamental tools for exploratory data analysis and serve as building blocks for outlier detection and visualization techniques like box plots.
© 2025 ApX Machine Learning