After exploring individual variables and the relationships between two numerical or two categorical variables, a common analytical task is to understand how a numerical measurement varies across different groups or categories. For instance, how does the average salary (numerical) differ across various job titles (categorical)? Or how does customer spending (numerical) change based on their membership tier (categorical)? This type of analysis helps identify significant differences in distributions, central tendencies, and variability between groups.
Visualizations are particularly effective for this comparison. While calculating summary statistics like the mean or median for each group provides a quantitative summary, plots offer a richer, more intuitive understanding of the underlying distributions. We will focus on two powerful visualization techniques for comparing a numerical variable across categories: grouped box plots and violin plots.
You've likely encountered box plots in univariate analysis to summarize the distribution of a single numerical variable. They display the median, quartiles (Interquartile Range, IQR), and potential outliers. By placing multiple box plots side-by-side, one for each category of a categorical variable, we can directly compare these summary statistics across groups.
Using Seaborn, creating grouped box plots is straightforward. You specify the categorical variable for the x-axis (or y-axis) and the numerical variable for the other axis.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
# Sample Data Generation (replace with your actual data)
np.random.seed(42)
data = {
'Category': np.random.choice(['Alpha', 'Beta', 'Gamma'], size=150),
'Value': np.concatenate([
np.random.normal(50, 15, 50), # Alpha
np.random.normal(65, 10, 50), # Beta
np.random.normal(55, 20, 50) # Gamma
])
}
df = pd.DataFrame(data)
# Create the grouped box plot
plt.figure(figsize=(8, 5)) # Set figure size for better readability
sns.boxplot(x='Category', y='Value', data=df, palette=['#74c0fc', '#ffa94d', '#69db7c']) # Using blue, orange, green
# Add plot enhancements
plt.title('Distribution of Value Across Categories')
plt.xlabel('Category Type')
plt.ylabel('Measurement Value')
plt.grid(axis='y', linestyle='--', alpha=0.7) # Add horizontal grid lines
plt.show()
A grouped box plot comparing the distribution of a numerical 'Value' across three 'Category' types: Alpha, Beta, and Gamma. Beta appears to have a higher median value and less variability (smaller IQR) compared to Alpha and Gamma. Gamma shows the largest spread and a slightly higher median than Alpha.
When interpreting grouped box plots, look for:
Violin plots are an alternative and often more informative method for comparing distributions across categories. They combine a box plot's summary statistics with a kernel density estimate (KDE) plotted on each side. This provides insight into the shape of the distribution, revealing features like multimodality (multiple peaks) that box plots obscure.
The width of the violin at any given value indicates the density of data points around that value. The inner part can optionally show a mini box plot, quartiles, or just the median point.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
# Reusing the sample data from the box plot example
np.random.seed(42)
data = {
'Category': np.random.choice(['Alpha', 'Beta', 'Gamma'], size=150),
'Value': np.concatenate([
np.random.normal(50, 15, 50), # Alpha
np.random.normal(65, 10, 50), # Beta
np.random.normal(55, 20, 50) # Gamma
])
}
df = pd.DataFrame(data)
# Create the grouped violin plot
plt.figure(figsize=(8, 5))
sns.violinplot(x='Category', y='Value', data=df, palette=['#74c0fc', '#ffa94d', '#69db7c'], inner='quartile') # Show quartiles inside
# Add plot enhancements
plt.title('Distribution Shape of Value Across Categories')
plt.xlabel('Category Type')
plt.ylabel('Measurement Value')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
A grouped violin plot showing the distribution shape of 'Value' for categories Alpha, Beta, and Gamma. The plot confirms Beta's higher central tendency and tighter distribution. The shapes also suggest Gamma might be slightly more spread out than Alpha, even though their medians are relatively close. All distributions appear roughly unimodal.
When interpreting violin plots, consider:
Both plot types are valuable tools for exploring the relationship between a numerical and a categorical variable. By visualizing these comparisons, you gain deeper insights into how different groups behave with respect to a quantitative measure, identifying patterns that might be missed by looking only at summary numbers. This understanding is often a stepping stone towards more formal statistical testing or feature engineering.
© 2025 ApX Machine Learning