Okay, you've loaded your data and are ready to start exploring it. The first step in understanding any dataset is often calculating basic descriptive statistics. These numbers summarize the data's main characteristics, giving you a quick snapshot of its central value and how spread out the values are. Think of them as the vital signs for your dataset.
We'll focus on two primary types of summary statistics: measures of central tendency and measures of spread (or variability).
Measures of central tendency aim to describe the "typical" or "central" value in your dataset. What single number best represents the entire group? The three most common measures are the mean, median, and mode.
The mean is probably the most familiar measure. It's simply the sum of all the values divided by the number of values. If you have a dataset with n observations represented as x1,x2,...,xn, the sample mean (often denoted as xˉ) is calculated as:
xˉ=n∑i=1nxi=nx1+x2+...+xnExample: Let's say we have the ages of 5 employees: [25, 30, 28, 45, 28]
.
The sum is 25+30+28+45+28=156.
The number of values is n=5.
The mean age is xˉ=156/5=31.2 years.
The mean uses every value in the dataset, which is good, but it also makes it sensitive to outliers (extremely high or low values). That 45-year-old pulls the average age up. If that value were 85 instead, the mean would jump significantly, even though most employees are much younger.
The median is the middle value when the data is sorted in ascending order. It splits the dataset exactly in half: 50% of the values are below the median, and 50% are above it.
To find the median:
Example (Odd n): Using the sorted ages [25, 28, 28, 30, 45]
.
The number of values is n=5 (odd).
The middle value is the (n+1)/2=(5+1)/2=3rd value.
The median (Me) is 28.
Example (Even n): Let's add another age, 22: [22, 25, 28, 28, 30, 45]
.
The number of values is n=6 (even).
The middle two values are the n/2=6/2=3rd and the (n/2)+1=4th values. These are 28 and 28.
The median (Me) is the average of these two: (28+28)/2=28.
The median is much less affected by outliers than the mean. If our oldest employee was 85 instead of 45, the sorted list would be [25, 28, 28, 30, 85]
, and the median would still be 28. This makes the median a better indicator of the "typical" value in datasets with skewed distributions or extreme values.
The mode is simply the value that appears most often in the dataset.
Example: In our original age dataset [25, 30, 28, 45, 28]
, the value 28 appears twice, more than any other value.
The mode is 28.
A dataset can have:
[2, 3, 3, 4, 5, 5, 6]
has modes 3 and 5.[10, 20, 30, 40]
.The mode is especially useful for categorical data (non-numerical data like "color" or "product type"), where mean and median don't make sense. For numerical data, it tells you the most common specific value.
Often, reporting both the mean and median gives a more complete picture, especially if they differ significantly, suggesting skewness or outliers.
Histogram of the sample ages
[25, 30, 28, 45, 28]
, showing the calculated mean, median, and mode. Notice how the single higher value (45) pulls the mean slightly higher than the median and mode.
Knowing the center of your data is only part of the story. You also need to know how spread out the data points are. Are they all clustered tightly around the mean, or are they widely dispersed? Measures of spread, or variability, answer this question.
The range is the simplest measure of spread. It's the difference between the maximum and minimum values in the dataset.
Range = Maximum Value - Minimum Value
Example: For our ages [25, 28, 28, 30, 45]
:
Maximum = 45
Minimum = 25
Range = 45 - 25 = 20 years.
The range gives a quick sense of the total span of the data, but like the mean, it's highly sensitive to outliers. Just one very high or very low value dramatically affects the range. It also doesn't tell you anything about how the data is distributed between the extremes.
Variance measures the average squared difference of each data point from the mean. It gives you a sense of the overall spread. A larger variance means data points tend to be further from the mean; a smaller variance means they tend to be closer.
The formula for sample variance (s2) looks a bit complex, but the idea is straightforward:
s2=n−1∑i=1n(xi−xˉ)2Let's break it down:
Example: Using ages [25, 28, 28, 30, 45]
and mean xˉ=31.2:
The variance is 62.7. What does this number mean? It's in "squared years," which isn't very intuitive. That's where standard deviation comes in.
The standard deviation is simply the square root of the variance. It's often preferred because it brings the measure of spread back into the original units of the data.
Standard Deviation (s) = Variance=s2
Example: For our ages, the variance s2=62.7. The standard deviation s=62.7≈7.92 years.
The standard deviation gives you a measure of the typical or average distance of the data points from the mean. A standard deviation of 7.92 years suggests that, on average, the employees' ages fall about 7.9 years away from the mean age of 31.2.
Like the mean, variance and standard deviation are sensitive to outliers because they are based on the mean and involve squared deviations, which heavily weight extreme values.
These summary statistics (mean, median, mode, range, variance, standard deviation) are fundamental building blocks for understanding your data. Calculating them is often the very first step in any Exploratory Data Analysis, providing a concise quantitative description of the dataset's characteristics before you proceed to visualization or more complex modeling.
© 2025 ApX Machine Learning