While understanding the formulas for mean, variance, correlation, and other descriptive statistics is fundamental, calculating them manually for anything larger than a toy dataset quickly becomes impractical. This is where Python's data analysis libraries, particularly Pandas, become indispensable tools for data scientists and machine learning practitioners. Pandas provides efficient and easy-to-use functions to compute a wide array of descriptive statistics on your data, typically stored in Series or DataFrame objects.
Let's assume you have your data loaded into a Pandas DataFrame. If you're following along, you can create a sample DataFrame like this:
import pandas as pd
import numpy as np
# Create sample data
data = {'ExamScore': [78, 85, 92, 65, 72, 88, 95, 81, 76, 80, np.nan, 83],
'StudyHours': [5, 6, 8, 3, 4, 7, 9, 5.5, 4.5, 5, 2, 6],
'SleepHours': [7, 6.5, 7.5, 8, 7, 6, 7, 7.5, 8, 6.5, 9, 7]}
df = pd.DataFrame(data)
print(df)
This creates a DataFrame df
with scores, study hours, and sleep hours for different students, including one missing exam score (np.nan
).
.describe()
Method: A Quick OverviewOften, the first step in exploring a numerical dataset with Pandas is the .describe()
method. It provides a concise summary of the central tendency, dispersion, and shape of the distribution for each numerical column in a DataFrame (or for a single Series).
# Get summary statistics for numerical columns
summary_stats = df.describe()
print(summary_stats)
Running this will produce output similar to:
ExamScore StudyHours SleepHours
count 11.000000 12.000000 12.000000
mean 81.363636 5.833333 7.208333
std 8.737508 1.991495 0.793394
min 65.000000 2.000000 6.000000
25% 77.000000 4.500000 6.500000
50% 81.000000 5.750000 7.000000
75% 86.500000 7.000000 7.500000
max 95.000000 9.000000 9.000000
Notice a few things:
count
: Shows the number of non-missing values. ExamScore
has 11, reflecting the np.nan
.mean
: The average value.std
: The standard deviation, measuring spread.min
, max
: Minimum and maximum values.25%
, 50%
, 75%
: These are the quartiles (percentiles). The 50th percentile is the median.The .describe()
method is excellent for getting a quick feel for your data's distribution and scale.
While .describe()
is convenient, you'll often need specific statistics. Pandas provides dedicated methods for these. You can apply them to a whole DataFrame (calculating the statistic for each column) or a single Series (a specific column).
# Calculate mean for each column
means = df.mean()
print("Means:\n", means)
# Calculate median for the 'ExamScore' column
median_score = df['ExamScore'].median()
print(f"\nMedian Exam Score: {median_score}")
# Calculate mode for 'SleepHours'
# Mode can return multiple values if they have the same highest frequency
modes_sleep = df['SleepHours'].mode()
print("\nMode(s) for Sleep Hours:\n", modes_sleep)
These functions (.mean()
, .median()
, .mode()
) automatically handle missing values by default (controlled by the skipna=True
parameter).
# Calculate variance for each column
variances = df.var()
print("Variances:\n", variances)
# Calculate standard deviation for 'StudyHours'
std_study = df['StudyHours'].std()
print(f"\nStandard Deviation Study Hours: {std_study:.4f}")
# Calculate minimum and maximum values
min_values = df.min()
max_values = df.max()
print("\nMinimum Values:\n", min_values)
print("\nMaximum Values:\n", max_values)
# Calculate specific percentiles (e.g., 10th and 90th) for 'ExamScore'
p10 = df['ExamScore'].quantile(0.10)
p90 = df['ExamScore'].quantile(0.90)
print(f"\n10th Percentile Exam Score: {p10}")
print(f"90th Percentile Exam Score: {p90}")
# Calculate the Interquartile Range (IQR) for 'ExamScore'
q1 = df['ExamScore'].quantile(0.25)
q3 = df['ExamScore'].quantile(0.75)
iqr = q3 - q1
print(f"IQR for Exam Score: {iqr}")
The .quantile(q)
method is versatile for finding any percentile, where q is between 0 and 1. The range can be calculated simply by subtracting the result of .min()
from .max()
.
Skewness and kurtosis tell you about the asymmetry and peakedness of the distribution, respectively.
# Calculate skewness for each column
skewness = df.skew()
print("Skewness:\n", skewness)
# Calculate kurtosis for 'StudyHours'
kurt_study = df['StudyHours'].kurt() # Fisher's definition (normal dist = 0)
# kurt_study = df['StudyHours'].kurtosis() # Same as .kurt()
print(f"\nKurtosis for Study Hours: {kurt_study:.4f}")
Positive skewness indicates a tail extending towards higher values, while negative skewness indicates a tail towards lower values. Kurtosis measures the "tailedness"; higher kurtosis means more outliers or heavier tails compared to a normal distribution.
To understand the linear relationship between pairs of numerical variables, use the .corr()
method on the DataFrame.
# Calculate the pairwise correlation between columns
correlation_matrix = df.corr()
print("\nCorrelation Matrix:\n", correlation_matrix)
This outputs a correlation matrix where each cell (i,j) contains the Pearson correlation coefficient between column i and column j. The diagonal elements are always 1 (correlation of a variable with itself).
ExamScore StudyHours SleepHours
ExamScore 1.000000 0.946434 -0.394137
StudyHours 0.946434 1.000000 -0.417979
SleepHours -0.394137 -0.417979 1.000000
From this, we see a strong positive correlation (0.95) between ExamScore
and StudyHours
, suggesting students who study more tend to get higher scores. There's a moderate negative correlation between StudyHours
and SleepHours
, perhaps indicating that more study time might correlate with slightly less sleep in this sample. Remember, correlation does not imply causation!
While numerical summaries are essential, visualizing the data often provides deeper insights. Pandas integrates with Matplotlib, allowing for quick plots directly from DataFrames or Series. For more customized or advanced plots, libraries like Seaborn or Plotly are commonly used.
Here's how you might quickly visualize the distribution of ExamScore
using Plotly after calculating the statistics:
Histogram showing the frequency distribution of exam scores in the sample data.
This histogram complements the numerical statistics (like mean, median, skewness) by showing the shape of the score distribution visually.
Using Pandas effectively allows you to move quickly from raw data to meaningful statistical summaries, forming a basis for further analysis, visualization, and model building in machine learning workflows.
© 2025 ApX Machine Learning