After loading and getting a feel for the dataset's structure, a primary step in univariate analysis for numerical columns is to understand their "center". Where do the values tend to cluster? Measures of central tendency provide a single value summarizing the typical or central point of your data. We'll look at the three most common measures: the mean, median, and mode.
The most familiar measure of central tendency is the mean, often simply called the average. It's calculated by summing all the values in a variable and dividing by the number of values.
For a set of n values x1,x2,...,xn, the sample mean (xˉ) is calculated as: xˉ=n1∑i=1nxi=nx1+x2+⋯+xn
In Pandas, calculating the mean of a Series (a column in a DataFrame) is straightforward using the .mean()
method.
import pandas as pd
import numpy as np
# Sample data representing, perhaps, ages of users
data = {'age': [25, 31, 45, 22, 58, 31, 28, 35]}
df = pd.DataFrame(data)
# Calculate the mean age
mean_age = df['age'].mean()
print(f"Mean Age: {mean_age}")
# Expected Output: Mean Age: 34.375
The mean provides a good sense of the central value for data that is roughly symmetrically distributed. However, it has a significant drawback: it is sensitive to outliers. A single extremely high or low value can pull the mean substantially in its direction, potentially misrepresenting the "typical" value.
The median is the value that separates the higher half from the lower half of the data sample. To find it, you first sort the data and then select the middle value.
Pandas provides the .median()
method:
# Calculate the median age using the same DataFrame
median_age = df['age'].median()
print(f"Median Age: {median_age}")
# Sorted ages: [22, 25, 28, 31, 31, 35, 45, 58]
# Middle two values are 31 and 31. Their average is (31+31)/2 = 31.0
# Expected Output: Median Age: 31.0
The median's main advantage over the mean is its robustness to outliers. Extreme values have little to no impact on the median because it only depends on the value(s) in the middle position(s) after sorting. Therefore, the median is often a better measure of central tendency for skewed distributions or datasets with known or suspected outliers. If the mean and median are significantly different, it often suggests skewness or the presence of outliers.
A simple histogram illustrating the age data. The dashed red line indicates the mean (34.375), while the dotted blue line shows the median (31.0). The outlier (58) pulls the mean higher than the median.
The mode is the value that appears most frequently in the dataset. A dataset can have one mode (unimodal), two modes (bimodal), or multiple modes (multimodal). It's also possible for a dataset to have no mode if all values occur with the same frequency.
While the mode can be calculated for numerical data, it's often more informative for categorical data or discrete numerical data with a limited number of unique values. For continuous numerical data, the mode might not be very meaningful unless the data has been binned.
Pandas calculates the mode using the .mode()
method. Note that .mode()
always returns a Pandas Series, because there might be multiple modes (all values that share the highest frequency).
# Calculate the mode age
# Note: .mode() returns a Series
mode_age = df['age'].mode()
print(f"Mode Age(s):\n{mode_age}")
# Expected Output:
# Mode Age(s):
# 0 31
# dtype: int64
# Example with multiple modes
data_multi_mode = {'value': [10, 20, 30, 20, 40, 10, 50]}
df_multi = pd.DataFrame(data_multi_mode)
mode_multi = df_multi['value'].mode()
print(f"\nMultiple Modes:\n{mode_multi}")
# Expected Output:
# Multiple Modes:
# 0 10
# 1 20
# dtype: int64
In our age
example, the value 31
appears twice, more than any other age, making it the mode. If another age also appeared twice, both would be returned by .mode()
.
Understanding these three measures provides a foundational summary of where your data centers. Comparing the mean and median is often a quick check for data skewness, guiding further analysis and visualization choices.
© 2025 ApX Machine Learning