When analyzing a dataset, one of the first things we want to understand is its "center" or a typical value that represents the data. Where do the data points tend to cluster? Measures of central tendency provide this information. The three most common measures are the mean, median, and mode. Each offers a different perspective on the center of the data, and understanding their differences is important for accurate interpretation.
The mean, often called the average, is the most widely used measure of central tendency. It's calculated by summing all the values in a dataset and dividing by the number of values.
For a dataset with n values x1,x2,…,xn, the sample mean, usually denoted by xˉ (pronounced "x-bar"), is calculated as:
xˉ=nx1+x2+⋯+xn=n∑i=1nxiExample: Consider the ages of 5 people: 22, 25, 21, 30, 22. To find the mean age:
xˉ=522+25+21+30+22=5120=24The mean age is 24 years.
When to use the Mean: The mean is a good measure of the center when the data distribution is roughly symmetrical and doesn't have extreme values (outliers).
Sensitivity to Outliers: A significant drawback of the mean is its sensitivity to outliers. An outlier is a data point that is significantly different from other observations. Let's change the age 30 in our example to 90 (perhaps a data entry error): 22, 25, 21, 90, 22. The new mean is:
xˉ=522+25+21+90+22=5180=36The mean jumps from 24 to 36, which is higher than most of the ages in the group. The single outlier heavily influenced the mean, making it potentially less representative of the typical age.
The median is the middle value in a dataset that has been ordered from smallest to largest. It divides the data into two equal halves: 50% of the data points are below the median, and 50% are above it.
Calculation:
Example (Odd n): Using the original ages: 22, 25, 21, 30, 22.
Example (Even n): Consider four ages: 21, 22, 25, 30.
When to use the Median: The median is particularly useful when dealing with skewed distributions or datasets containing outliers. Because it only depends on the middle value(s), it's not affected by extreme values at the ends of the distribution.
Robustness to Outliers: Let's revisit the outlier example: 21, 22, 22, 25, 90 (ordered ages). The median is still the 3rd value, which is 22. The outlier (90) did not change the median, making it a more robust measure of central tendency in this case compared to the mean (which was 36).
The mode is the value that appears most frequently in a dataset.
Calculation: Simply count the occurrences of each value. The value(s) with the highest count is the mode.
Example: Using the original ages: 22, 25, 21, 30, 22. The value 22 appears twice, more than any other age. The mode is 22.
Characteristics:
When to use the Mode: The mode is useful for identifying the most common category or value in a dataset, especially with categorical data or discrete numerical data with a limited number of values. It's less commonly used as the primary measure of center for continuous numerical data compared to the mean or median.
Measure | Calculation | Use Case | Sensitivity to Outliers | Applicable Data Types |
---|---|---|---|---|
Mean | Sum / Count | Symmetrical data, no outliers | High | Numerical |
Median | Middle value (ordered data) | Skewed data, data with outliers | Low | Numerical (Ordinal) |
Mode | Most frequent value | Categorical data, finding most common value | Low | Numerical, Categorical |
The relationship between the mean and median can also give clues about the shape of the data distribution:
Distribution of the original ages dataset [21, 22, 22, 25, 30]. The median and mode are both 22, while the mean is slightly higher at 24, pulled towards the larger value of 30.
Choosing the right measure depends on the nature of your data and the question you are trying to answer. Understanding all three provides a more complete picture of the data's central tendency. In the following sections, we'll explore how to measure the spread or variability around this center.
© 2025 ApX Machine Learning