While measures like the mean, median, and mode tell us about the typical value or center of our data, they don't paint the full picture. Consider two sets of exam scores: {70, 75, 80, 85, 90} and {60, 70, 80, 90, 100}. Both have a mean and median of 80, but the scores in the second set are clearly more spread out. To understand a dataset fully, we also need to quantify this spread, often referred to as dispersion or variability. This section introduces the standard ways to measure how tightly or loosely data points cluster around the center.
The simplest measure of dispersion is the range. It's calculated by subtracting the minimum value from the maximum value in the dataset.
Range = Maximum Value - Minimum Value
For our example score sets:
The range gives a quick sense of the total span covered by the data. However, its major drawback is its sensitivity to extreme values, or outliers. A single very high or very low value can drastically inflate the range, potentially giving a misleading impression of the overall spread. Because it only uses two data points, it ignores how the rest of the data is distributed.
In Pandas, you can calculate the range easily:
import pandas as pd
data = {'scores_set1': [70, 75, 80, 85, 90],
'scores_set2': [60, 70, 80, 90, 100]}
df = pd.DataFrame(data)
range_set1 = df['scores_set1'].max() - df['scores_set1'].min()
range_set2 = df['scores_set2'].max() - df['scores_set2'].min()
print(f"Range for Set 1: {range_set1}")
print(f"Range for Set 2: {range_set2}")
# Output:
# Range for Set 1: 20
# Range for Set 2: 40
A more robust measure of spread, less influenced by outliers, is the Interquartile Range (IQR). As the name suggests, it involves quartiles, which divide the sorted data into four equal parts.
The IQR is the difference between the third and first quartiles:
IQR = Q3 - Q1
It represents the range spanned by the middle 50% of the data. Because it discards the lowest 25% and the highest 25% of values, it's not affected by extreme outliers in the tails of the distribution. This makes it a particularly useful measure for skewed datasets or data where outliers are present. Box plots, which we'll encounter in visualization sections, graphically represent the IQR.
Using Pandas, we calculate the IQR using the quantile
method:
q1_set1 = df['scores_set1'].quantile(0.25)
q3_set1 = df['scores_set1'].quantile(0.75)
iqr_set1 = q3_set1 - q1_set1
q1_set2 = df['scores_set2'].quantile(0.25)
q3_set2 = df['scores_set2'].quantile(0.75)
iqr_set2 = q3_set2 - q1_set2
print(f"IQR for Set 1: {iqr_set1}")
print(f"IQR for Set 2: {iqr_set2}")
# Output:
# IQR for Set 1: 10.0
# IQR for Set 2: 20.0
Notice the IQR also reflects that Set 2 is more spread out than Set 1, similar to the range, but it focuses on the central bulk of the data.
While the range and IQR provide useful summaries, they don't incorporate information from all data points to describe spread relative to the center. Variance does exactly this. It measures the average squared difference of each data point from the mean.
Why squared differences? If we simply averaged the differences (xi−mean), the positive and negative differences would cancel each other out, often resulting in a sum close to zero, regardless of the actual spread. Squaring the differences makes all contributions positive and emphasizes larger deviations.
There are two common formulas for variance, depending on whether you're working with an entire population or a sample drawn from a population:
Population Variance (σ2): If your dataset represents the entire population of interest. σ2=N1∑i=1N(xi−μ)2 Here, N is the total number of data points in the population, xi represents each individual data point, and μ is the population mean.
Sample Variance (s2): If your dataset is a sample, and you want to estimate the variance of the larger population from which the sample was drawn. s2=n−11∑i=1n(xi−xˉ)2 Here, n is the sample size, xi represents each data point in the sample, and xˉ is the sample mean.
The crucial difference is the denominator: N for the population variance and n−1 for the sample variance. Dividing by n−1 (known as Bessel's correction) makes s2 an unbiased estimator of the true population variance σ2. In practice, especially in data analysis and machine learning, you're almost always working with samples, so the sample variance formula is the one typically used.
The main drawback of variance is its units. If your data represents scores (points), the variance is in units of points-squared, which isn't directly interpretable in the context of the original data scale.
Pandas calculates the sample variance by default using the .var()
method:
variance_set1 = df['scores_set1'].var() # ddof=1 by default (sample variance)
variance_set2 = df['scores_set2'].var()
print(f"Sample Variance for Set 1: {variance_set1}")
print(f"Sample Variance for Set 2: {variance_set2}")
# Output:
# Sample Variance for Set 1: 62.5
# Sample Variance for Set 2: 250.0
# To calculate population variance (if the data was the entire population)
pop_variance_set1 = df['scores_set1'].var(ddof=0)
print(f"Population Variance for Set 1: {pop_variance_set1}")
# Output:
# Population Variance for Set 1: 50.0
To overcome the interpretability issue of variance's squared units, we use the standard deviation. It's simply the square root of the variance.
The standard deviation represents the typical or average distance of data points from the mean, measured in the original units of the data. This makes it much more intuitive than variance. A smaller standard deviation indicates that data points tend to be close to the mean, while a larger standard deviation indicates that data points are spread out over a wider range.
For our score sets:
This clearly shows that, on average, scores in Set 2 deviate further from the mean (80) than scores in Set 1.
Like variance, the standard deviation uses all data points but is also sensitive to outliers, although the square root lessens the impact compared to variance. It's a cornerstone statistic, often used in conjunction with the mean, especially when data approximates a Normal (Gaussian) distribution.
Pandas calculates the sample standard deviation by default using .std()
:
std_dev_set1 = df['scores_set1'].std() # ddof=1 by default (sample std dev)
std_dev_set2 = df['scores_set2'].std()
print(f"Sample Standard Deviation for Set 1: {std_dev_set1:.2f}")
print(f"Sample Standard Deviation for Set 2: {std_dev_set2:.2f}")
# Output:
# Sample Standard Deviation for Set 1: 7.91
# Sample Standard Deviation for Set 2: 15.81
Here's a visual comparison using box plots, which clearly show the difference in spread (represented by the box length/IQR and whisker length/range) between the two sets, even though they share the same median (the line inside the box).
Box plot comparing the two score sets. Set 2 exhibits a larger range and IQR (box length), indicating greater dispersion around the common median of 80.
In summary, measures of dispersion quantify the spread of data:
Choosing the appropriate measure depends on the data characteristics (especially the presence of outliers) and the goals of your analysis. Understanding both central tendency and dispersion is fundamental to effectively summarizing and interpreting datasets.
© 2025 ApX Machine Learning