Knowing the center of your data with the mean, median, or mode gives you a valuable reference point. However, it doesn't tell the full story. Consider two sets of exam scores: both might have a mean score of 75, but in one set, scores might cluster tightly between 70 and 80, while in the other, they might range wildly from 40 to 100. Measures of central tendency alone don't capture this difference in spread or variability.
To quantify how spread out the data points are, we use measures of dispersion. The most common and important measures are variance and standard deviation. They tell us how much, on average, the data points deviate from their mean.
Imagine you have a dataset and you've calculated its mean (μ for a population, xˉ for a sample). A natural first step to measure spread might be to calculate each data point's deviation from the mean (xi−μ or xi−xˉ) and average these deviations. However, there's a problem: the positive deviations (points above the mean) and negative deviations (points below the mean) will always cancel each other out, resulting in an average deviation of zero. This isn't very informative!
To overcome this cancellation, we square each deviation before averaging them. Squaring makes all deviations positive, ensuring they don't cancel out. This average of the squared deviations is called the variance.
If your dataset represents the entire population of interest, the variance (denoted by σ2, the Greek letter sigma squared) is calculated as:
σ2=N∑i=1N(xi−μ)2Where:
More often, we work with a sample, a subset of the population. When estimating the population variance from a sample, we use a slightly different formula for the sample variance (denoted by s2):
s2=n−1∑i=1n(xi−xˉ)2Where:
Notice the denominator is n−1 instead of n. This is known as Bessel's correction. Using n−1 makes the sample variance an unbiased estimator of the population variance, meaning that if you took many samples and calculated their variances using n−1, the average of those sample variances would be closer to the true population variance. Intuitively, the sample mean xˉ is calculated from the sample itself, making the sample data points slightly closer to xˉ on average than they would be to the true population mean μ. Dividing by a smaller number (n−1) slightly inflates the variance estimate, counteracting this effect. For a beginner level, the main takeaway is that standard statistical software and libraries use n−1 when calculating variance from sample data.
A larger variance indicates that the data points are, on average, farther away from the mean (more spread out). A smaller variance indicates they are clustered more closely around the mean.
One drawback of variance is its units. If your data represents heights in centimeters (cm), the variance will be in square centimeters (cm²), which is difficult to interpret directly in the context of the original data. This leads us to the standard deviation.
The standard deviation is simply the square root of the variance. It's often preferred over variance because it brings the measure of spread back into the same units as the original data, making it much more interpretable.
For a population, the standard deviation (denoted by σ) is:
σ=σ2=N∑i=1N(xi−μ)2For a sample, the standard deviation (denoted by s) is:
s=s2=n−1∑i=1n(xi−xˉ)2The standard deviation provides a measure of the typical or average distance of data points from the mean.
Let's look at two datasets with the same mean but different standard deviations.
Both datasets might have a mean around 75. Dataset A (blue) has values clustered tightly, resulting in a low standard deviation. Dataset B (pink) has values spread much wider, resulting in a high standard deviation.
Understanding variance and standard deviation is fundamental in statistics and machine learning. They are used in:
While the range gives a quick sense of the total spread, variance and standard deviation provide a more nuanced and widely used measure of how data clusters around the central value. You'll learn how to calculate these efficiently using Python in the "Calculating Descriptive Statistics with Python" section later in this chapter.
© 2025 ApX Machine Learning