Okay, let's put theory into practice. You've learned about various ways to summarize data numerically. Now, we'll apply these techniques to a sample dataset using Python, focusing on how descriptive statistics help us gain initial insights before diving into more complex modeling. We'll use the Pandas library, which is indispensable for data manipulation and analysis in Python.
First, ensure you have the necessary libraries installed. If not, you can typically install them using pip:
pip install pandas numpy scipy matplotlib seaborn plotly
Now, let's import the libraries we'll use in our Python script or Jupyter Notebook:
import pandas as pd
import numpy as np
import scipy.stats as stats
import plotly.express as px
import plotly.graph_objects as go
For this exercise, we'll work with a simulated dataset representing some measurements. Imagine these could be sensor readings, user activity metrics, or any set of observations you might encounter. Let's create this dataset directly using Pandas and NumPy.
# Seed for reproducibility
np.random.seed(42)
# Generate data
n_samples = 150
data = {
'feature_A': np.random.normal(loc=50, scale=15, size=n_samples),
'feature_B': np.random.gamma(shape=2, scale=10, size=n_samples) + 20, # Skewed
'feature_C': 0.7 * np.random.normal(loc=50, scale=15, size=n_samples) + np.random.normal(loc=0, scale=5, size=n_samples) + 10, # Correlated with A
'category': np.random.choice(['Type X', 'Type Y', 'Type Z'], size=n_samples, p=[0.4, 0.35, 0.25])
}
df = pd.DataFrame(data)
# Ensure no negative values for typical features
df['feature_A'] = df['feature_A'].clip(lower=0)
df['feature_B'] = df['feature_B'].clip(lower=0)
df['feature_C'] = df['feature_C'].clip(lower=0)
print("Dataset dimensions:", df.shape)
print("\nFirst 5 rows:")
print(df.head())
print("\nData types and non-null counts:")
df.info()
The output from df.head()
gives us a quick look at the first few rows, while df.info()
tells us the number of entries, the column names, the count of non-null values per column, and the data type of each column. We see 150 entries and 4 columns, with no missing values in this case. feature_A
, feature_B
, and feature_C
are numerical (float64
), and category
is categorical (object
).
The .describe()
method in Pandas is excellent for getting a quick statistical summary of the numerical columns.
# Get summary statistics for numerical columns
summary_stats = df.describe()
print("\nSummary Statistics:")
print(summary_stats)
This output provides several key statistics we've discussed:
count
: The number of non-missing observations.mean
: The average value.std
: The standard deviation, measuring spread.min
: The minimum value.25%
: The first quartile (Q1).50%
: The median (second quartile, Q2).75%
: The third quartile (Q3).max
: The maximum value.Looking at the output, we can already make some observations:
feature_A
has a mean around 50, close to its median, suggesting a relatively symmetric distribution. Its standard deviation is about 14.feature_B
has a mean (around 40) noticeably larger than its median (around 36). This suggests a right-skewed distribution. The range (max - min) is quite large compared to feature_A
.feature_C
has characteristics somewhat similar to feature_A
, with a mean close to its median.While .describe()
is useful, sometimes we need specific statistics or want to calculate them individually for clarity or for non-numerical columns (like the mode).
Let's calculate the mean, median, and mode for feature_B
, which seemed skewed.
# Central Tendency for feature_B
mean_b = df['feature_B'].mean()
median_b = df['feature_B'].median()
mode_b = df['feature_B'].mode() # Mode can return multiple values if they have the same highest frequency
print(f"\nFeature B - Mean: {mean_b:.2f}")
print(f"Feature B - Median: {median_b:.2f}")
print(f"Feature B - Mode(s): {mode_b.tolist()}") # Display modes as a list
# Mode for the categorical column
mode_category = df['category'].mode()
print(f"\nCategory - Mode(s): {mode_category.tolist()}")
As expected, the mean of feature_B
is pulled higher than the median due to the right skew. The mode represents the most frequent value(s). For the categorical feature, 'Type X' is the most common category.
Let's examine the spread of feature_A
and feature_B
.
# Dispersion for feature_A
variance_a = df['feature_A'].var()
std_dev_a = df['feature_A'].std()
range_a = df['feature_A'].max() - df['feature_A'].min()
iqr_a = df['feature_A'].quantile(0.75) - df['feature_A'].quantile(0.25)
print(f"\nFeature A - Variance: {variance_a:.2f}")
print(f"Feature A - Standard Deviation: {std_dev_a:.2f}")
print(f"Feature A - Range: {range_a:.2f}")
print(f"Feature A - Interquartile Range (IQR): {iqr_a:.2f}")
# Dispersion for feature_B
variance_b = df['feature_B'].var()
std_dev_b = df['feature_B'].std()
range_b = df['feature_B'].max() - df['feature_B'].min()
iqr_b = df['feature_B'].quantile(0.75) - df['feature_B'].quantile(0.25)
print(f"\nFeature B - Variance: {variance_b:.2f}")
print(f"Feature B - Standard Deviation: {std_dev_b:.2f}")
print(f"Feature B - Range: {range_b:.2f}")
print(f"Feature B - Interquartile Range (IQR): {iqr_b:.2f}")
Comparing the standard deviations (14.06 for A vs. 14.49 for B) doesn't immediately reveal the difference in shape, but comparing the range (64.65 vs. 76.56) and IQR (17.50 vs. 16.26) starts to hint at feature_B
having more extreme values on one side (the right side, as indicated by the skew). The IQR is often more robust to outliers than the range or standard deviation.
Let's quantify the shape using skewness and kurtosis. We can use the scipy.stats
module.
# Shape for feature_A
skew_a = stats.skew(df['feature_A'])
kurt_a = stats.kurtosis(df['feature_A']) # Fisher’s definition (normal == 0)
print(f"\nFeature A - Skewness: {skew_a:.2f}")
print(f"Feature A - Kurtosis: {kurt_a:.2f}")
# Shape for feature_B
skew_b = stats.skew(df['feature_B'])
kurt_b = stats.kurtosis(df['feature_B'])
print(f"\nFeature B - Skewness: {skew_b:.2f}")
print(f"Feature B - Kurtosis: {kurt_b:.2f}")
feature_A
has a skewness close to 0 (-0.12), confirming its relative symmetry. Kurtosis is also near 0 (-0.22), suggesting a peak similar to a normal distribution.feature_B
has a positive skewness (1.02), confirming our earlier observation of a right skew (tail extends to the right). The positive kurtosis (1.15) indicates slightly heavier tails and a sharper peak compared to a normal distribution.Now, let's examine the linear relationships between the numerical features.
# Calculate the correlation matrix
correlation_matrix = df[['feature_A', 'feature_B', 'feature_C']].corr()
print("\nCorrelation Matrix:")
print(correlation_matrix)
The correlation matrix shows the Pearson correlation coefficient between each pair of variables.
feature_A
and feature_C
, as expected from our data generation process.feature_B
shows weak correlations with feature_A
(around 0.05) and feature_C
(around 0.09).Remember, correlation measures linear association. A low correlation doesn't necessarily mean no relationship exists, just not a linear one. And critically, correlation does not imply causation. Even though A and C are correlated, we cannot conclude that A causes C or vice-versa based solely on this value.
Numerical summaries are powerful, but visualizations often provide more intuitive understanding.
Histograms help visualize the distribution of a single numerical variable.
# Histogram for feature_A
fig_hist_a = px.histogram(df, x='feature_A', nbins=20, title='Distribution of Feature A',
color_discrete_sequence=['#339af0']) # Blue
fig_hist_a.update_layout(bargap=0.1)
fig_hist_a.show()
# Histogram for feature_B
fig_hist_b = px.histogram(df, x='feature_B', nbins=20, title='Distribution of Feature B',
color_discrete_sequence=['#20c997']) # Teal
fig_hist_b.update_layout(bargap=0.1)
fig_hist_b.show()
Histogram of
feature_A
shows a roughly bell-shaped, symmetric distribution centered near 50.
Histogram of
feature_B
clearly shows the right skew, with most values clustered on the left and a tail extending towards higher values.
Box plots are excellent for comparing distributions or summarizing a single distribution's quartiles, median, and potential outliers.
# Box plot for all numerical features
fig_box = px.box(df, y=['feature_A', 'feature_B', 'feature_C'], title='Box Plots of Numerical Features',
color_discrete_sequence=['#339af0', '#20c997', '#7048e8']) # Blue, Teal, Violet
fig_box.show()
Box plots visually compare the median (line inside the box), IQR (the box itself), and range (whiskers) of the features. Outliers may be plotted as individual points. Note the higher median and longer upper whisker/outliers for
feature_B
, indicating the right skew.
Scatter plots help visualize the relationship between two numerical variables. Let's plot the highly correlated feature_A
and feature_C
.
# Scatter plot for feature_A vs feature_C
fig_scatter = px.scatter(df, x='feature_A', y='feature_C', title='Feature A vs Feature C',
trendline='ols', # Add Ordinary Least Squares regression line
color_discrete_sequence=['#be4bdb']) # Grape
fig_scatter.show()
Scatter plot showing a positive linear trend between
feature_A
andfeature_C
, confirming the correlation coefficient calculated earlier. The points cluster around the regression line.
In this practice section, we applied the descriptive statistics concepts from this chapter to a sample dataset. Using Pandas, SciPy, and Plotly, we calculated measures of central tendency, dispersion, and shape, computed correlations, and created visualizations like histograms, box plots, and scatter plots.
This process of summarizing a dataset is a fundamental first step in any data analysis or machine learning project. It helps you:
Armed with these summary insights, you are better prepared to choose appropriate data preprocessing techniques, select suitable machine learning models, and interpret the results of further analyses.
© 2025 ApX Machine Learning