After getting a feel for the overall goal of descriptive statistics, let's start by pinpointing the "center" of our data. When someone asks what a typical value in a dataset looks like, they're usually asking about its central tendency. We have three primary ways to measure this: the mean, the median, and the mode. Each offers a different perspective on what constitutes the "middle" or "most common" value, and understanding their differences is important for accurate data interpretation.
The mean, specifically the arithmetic mean, is the most common measure of central tendency. It's calculated by summing all the values in a dataset and dividing by the number of values.
If we have a dataset with n observations denoted as x1,x2,...,xn, the sample mean, often represented by xˉ (pronounced "x-bar"), is calculated as:
xˉ=n1i=1∑nxi=nx1+x2+⋯+xnFor example, consider the dataset: {2, 3, 5, 6, 9}
.
The mean is (2+3+5+6+9)/5=25/5=5.
The mean incorporates every value in the dataset, which is one of its strengths. However, this also makes it sensitive to outliers or extreme values. A single very large or very small value can significantly pull the mean in its direction, potentially misrepresenting the center of the majority of the data.
Calculating the Mean in Python with Pandas:
Assuming you have your data in a Pandas Series or DataFrame column, calculating the mean is straightforward using the .mean()
method.
import pandas as pd
data = pd.Series([2, 3, 5, 6, 9, 100]) # Added an outlier: 100
# Calculate the mean
mean_value = data.mean()
print(f"The dataset: {data.tolist()}")
print(f"The mean is: {mean_value}")
# Output:
# The dataset: [2, 3, 5, 6, 9, 100]
# The mean is: 20.833333333333332
Notice how the outlier (100) significantly increased the mean compared to the initial value of 5 when the dataset was {2, 3, 5, 6, 9}
.
The median is the value that separates the higher half from the lower half of a dataset. To find it, you first need to sort the data in ascending order.
Let's revisit our examples:
Dataset: {2, 3, 5, 6, 9}
(sorted)
Dataset: {2, 3, 5, 6, 9, 100}
(sorted)
The primary advantage of the median is its robustness to outliers. Extreme values have little to no impact on the median because it only depends on the middle value(s) after sorting. This makes it a better measure of central tendency for datasets that are skewed or contain significant outliers.
Calculating the Median in Python with Pandas:
Pandas provides the .median()
method.
import pandas as pd
data_odd = pd.Series([2, 3, 5, 6, 9])
data_even = pd.Series([2, 3, 5, 6, 9, 100]) # With outlier
# Calculate the median
median_odd = data_odd.median()
median_even = data_even.median()
print(f"Dataset 1: {data_odd.tolist()}")
print(f"Median 1: {median_odd}")
print(f"\nDataset 2 (with outlier): {data_even.tolist()}")
print(f"Median 2: {median_even}")
# Output:
# Dataset 1: [2, 3, 5, 6, 9]
# Median 1: 5.0
#
# Dataset 2 (with outlier): [2, 3, 5, 6, 9, 100]
# Median 2: 5.5
Compare the mean (20.83) and median (5.5) for the dataset with the outlier. The median provides a much better sense of the "typical" value among the non-outlier numbers.
The mode is the value that appears most frequently in a dataset. A dataset can have:
{1, 2, 3, 4, 5}
).{1, 2, 2, 3, 4}
). The mode is 2.{1, 1, 2, 3, 3, 4}
). The modes are 1 and 3.The mode is particularly useful for categorical data (e.g., finding the most common color or category) but can also be used with numerical data, especially discrete data. It's the only measure of central tendency that makes sense for nominal categorical data. Unlike the mean and median, the mode is not necessarily unique.
Calculating the Mode in Python with Pandas:
The .mode()
method in Pandas returns a Series containing all modes (since there can be more than one).
import pandas as pd
data_unimodal = pd.Series([1, 2, 2, 3, 4, 4, 4, 5])
data_bimodal = pd.Series(['apple', 'banana', 'apple', 'orange', 'banana', 'banana'])
# Calculate the mode
mode_unimodal = data_unimodal.mode()
mode_bimodal = data_bimodal.mode()
print(f"Dataset 1: {data_unimodal.tolist()}")
print(f"Mode(s) 1: {mode_unimodal.tolist()}")
print(f"\nDataset 2: {data_bimodal.tolist()}")
print(f"Mode(s) 2: {mode_bimodal.tolist()}")
# Output:
# Dataset 1: [1, 2, 2, 3, 4, 4, 4, 5]
# Mode(s) 1: [4]
#
# Dataset 2: ['apple', 'banana', 'apple', 'orange', 'banana', 'banana']
# Mode(s) 2: ['banana']
The choice depends heavily on the nature of your data and what you want to communicate:
The relationship between the mean, median, and mode can also offer insights into the skewness of the distribution:
Consider this visualization showing approximate positions on different distribution shapes:
Approximate positions of Mean (dashed line), Median (dotted line), and Mode (peak frequency) for symmetric, right-skewed, and left-skewed distributions. In skewed distributions, the median typically lies between the mode and the mean.
Understanding the mean, median, and mode provides a first crucial step in summarizing your data. They tell you where the data tends to cluster, but they don't tell the whole story. Next, we'll explore how to measure how spread out the data is around this central point.
© 2025 ApX Machine Learning