While understanding the formulas for mean, median, variance, and other descriptive statistics is important, calculating them manually becomes impractical as datasets grow. Fortunately, Python provides powerful libraries specifically designed for numerical computation and data analysis, making these calculations efficient and straightforward.
We'll primarily use Pandas, which builds upon NumPy. Pandas DataFrames and Series objects (which you learned about when loading data in Chapter 1) come with built-in methods for computing most common descriptive statistics.
First, let's ensure we have Pandas and NumPy imported. We'll also create a small sample dataset represented as a Pandas Series to demonstrate the functions. Imagine this data represents the scores of students on a recent quiz.
import pandas as pd
import numpy as np
# Sample quiz scores
scores = pd.Series([85, 92, 78, 88, 92, 65, 78, 85, 90, 75, 88])
print(scores)
This will output our sample data:
0 85
1 92
2 78
3 88
4 92
5 65
6 78
7 85
8 90
9 75
10 88
dtype: int64
Now, let's calculate the descriptive statistics we learned about earlier.
Mean: The average value. Use the .mean()
method.
mean_score = scores.mean()
print(f"Mean Score: {mean_score}")
# Output: Mean Score: 83.27272727272727
Median: The middle value when the data is sorted. Use the .median()
method.
median_score = scores.median()
print(f"Median Score: {median_score}")
# Output: Median Score: 85.0
Mode: The most frequently occurring value(s). Use the .mode()
method. Note that the mode can return multiple values if they have the same highest frequency. It returns a Pandas Series.
mode_score = scores.mode()
print(f"Mode Score(s): \n{mode_score}")
# Output:
# Mode Score(s):
# 0 78
# 1 85
# 2 88
# 3 92
# dtype: int64
In this case, the scores 78, 85, 88, and 92 each appear twice, which is the highest frequency, so they are all modes.
Range: The difference between the maximum and minimum values. We can calculate this using the .max()
and .min()
methods.
score_range = scores.max() - scores.min()
print(f"Score Range: {score_range}")
# Output: Score Range: 27
Variance: The average of the squared differences from the mean. Pandas' .var()
method calculates the sample variance by default, which uses n−1 in the denominator (where n is the number of data points). This is generally what you want when working with samples.
variance_score = scores.var()
print(f"Sample Variance: {variance_score}")
# Output: Sample Variance: 77.61818181818182
Standard Deviation: The square root of the variance, measuring the typical deviation from the mean. Pandas' .std()
method calculates the sample standard deviation by default (square root of the sample variance).
std_dev_score = scores.std()
print(f"Sample Standard Deviation: {std_dev_score}")
# Output: Sample Standard Deviation: 8.810118150017105
Pandas provides the .quantile()
method to calculate percentiles. You pass the desired quantile q
as an argument (a value between 0 and 1).
Quartiles:
q=0.25
q=0.50
q=0.75
q1 = scores.quantile(0.25)
median_q2 = scores.quantile(0.50) # Same as scores.median()
q3 = scores.quantile(0.75)
print(f"Q1 (25th Percentile): {q1}")
print(f"Q2 (50th Percentile/Median): {median_q2}")
print(f"Q3 (75th Percentile): {q3}")
# Output:
# Q1 (25th Percentile): 78.0
# Q2 (50th Percentile/Median): 85.0
# Q3 (75th Percentile): 89.0
Interquartile Range (IQR): The difference between Q3 and Q1.
iqr = q3 - q1
print(f"Interquartile Range (IQR): {iqr}")
# Output: Interquartile Range (IQR): 11.0
You can also calculate other percentiles, like the 90th percentile:
p90 = scores.quantile(0.90)
print(f"90th Percentile: {p90}")
# Output: 90th Percentile: 91.0
Often, you'll want to see several of these descriptive statistics at once. Pandas provides a very convenient method called .describe()
that computes count, mean, standard deviation, minimum, maximum, and the main quartiles (25%, 50%, 75%) all in one go.
summary_stats = scores.describe()
print("Descriptive Statistics Summary:")
print(summary_stats)
This outputs a concise summary:
Descriptive Statistics Summary:
count 11.000000
mean 83.272727
std 8.810118
min 65.000000
25% 78.000000
50% 85.000000
75% 89.000000
max 92.000000
dtype: float64
This single command gives you a great initial overview of your data's distribution and central tendency.
If your data happens to be in a NumPy array instead of a Pandas Series or DataFrame, NumPy provides similar functions:
# Convert the Pandas Series to a NumPy array
scores_np = scores.to_numpy()
# NumPy calculations
mean_np = np.mean(scores_np)
median_np = np.median(scores_np)
# Note: NumPy doesn't have a direct mode function like Pandas.
# Need SciPy for that: from scipy import stats; stats.mode(scores_np)
var_np = np.var(scores_np) # Default ddof=0 (Population variance)
std_np = np.std(scores_np) # Default ddof=0 (Population std dev)
var_np_sample = np.var(scores_np, ddof=1) # Sample variance (like Pandas)
std_np_sample = np.std(scores_np, ddof=1) # Sample std dev (like Pandas)
q1_np = np.percentile(scores_np, 25) # Use np.percentile for quantiles
print(f"\nNumPy Mean: {mean_np}")
print(f"NumPy Median: {median_np}")
print(f"NumPy Population Variance: {var_np}")
print(f"NumPy Sample Variance (ddof=1): {var_np_sample}")
print(f"NumPy Q1: {q1_np}")
Notice the important difference for variance and standard deviation: NumPy's default (ddof=0
) calculates the population variance/standard deviation (dividing by n), while Pandas' default (ddof=1
) calculates the sample variance/standard deviation (dividing by n−1). When working with samples to infer population characteristics (which is common in machine learning), the sample versions (using ddof=1
) are generally preferred. You can control this behavior in NumPy using the ddof
(Delta Degrees of Freedom) parameter.
As you can see, Python libraries like Pandas and NumPy make calculating descriptive statistics extremely convenient. This allows you to quickly summarize and understand the essential features of your datasets, which is a fundamental first step in any data analysis or machine learning task.
© 2025 ApX Machine Learning