While cross-tabulations provide precise numerical counts for the combinations of categories between two variables, visualizations often make the patterns and relationships more immediately apparent. Bar charts are the standard tool for visualizing categorical data, and when dealing with two categorical variables, we commonly use either grouped or stacked bar charts. These plots help us compare frequencies or proportions across different category combinations.
A grouped bar chart places bars representing the counts (or proportions) of one categorical variable side-by-side, grouped according to the categories of the second variable. This format is particularly useful when you want to compare the counts of one variable directly across the different levels of another variable.
Imagine you have a dataset of passengers on a ship, and you've created a cross-tabulation showing survival status (Survived
: Yes/No) against passenger class (Pclass
: 1st/2nd/3rd). A grouped bar chart could display bars for 'Yes' and 'No' survival side-by-side for each passenger class. This makes it easy to see, for example, if the absolute number of survivors in 1st class was higher or lower than the number of non-survivors in 1st class, and how this comparison changes for 2nd and 3rd class.
Let's see how to create this using Seaborn. Assuming you have a Pandas DataFrame df
with columns CategoryA
and CategoryB
:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
# Sample Data (replace with your actual data)
data = {'CategoryA': ['X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'X', 'Y', 'Y'],
'CategoryB': ['A', 'A', 'B', 'B', 'A', 'B', 'B', 'A', 'A', 'B']}
df = pd.DataFrame(data)
# Create the grouped bar chart
plt.figure(figsize=(8, 5)) # Optional: Adjust figure size
sns.countplot(data=df, x='CategoryA', hue='CategoryB', palette=['#4dabf7', '#ff922b'])
# Add labels and title for clarity
plt.xlabel("Category A")
plt.ylabel("Count")
plt.title("Counts of Category B within Category A (Grouped)")
plt.legend(title='Category B')
plt.tight_layout() # Adjust layout
plt.show()
This code uses Seaborn's countplot
function. By specifying x='CategoryA'
and hue='CategoryB'
, Seaborn automatically calculates the counts of each combination and plots them as grouped bars. The hue
parameter is essential here; it tells Seaborn which variable should be used for grouping the bars within each x
category.
A grouped bar chart comparing counts of Category B ('A', 'B') for each level of Category A ('X', 'Y').
A stacked bar chart also represents the counts of combinations between two categorical variables, but instead of placing bars side-by-side, it stacks the bars representing the second variable's categories on top of each other within each bar of the first variable. The total height of each stacked bar represents the total count for that category of the primary variable (the one on the x-axis).
Stacked charts are excellent for understanding the composition or proportion of the second variable within each category of the first variable. Using our ship passenger example, a stacked bar chart with Pclass
on the x-axis would show single bars for 1st, 2nd, and 3rd class. Each bar would be segmented by color, showing the proportion of survivors ('Yes') and non-survivors ('No') within that class. This makes it easy to visually compare the survival rate across classes.
You can create stacked bar charts using Pandas plotting capabilities after creating a cross-tabulation, or directly with Seaborn's histplot
(which can handle counts for categorical data too) using the multiple='stack'
argument.
Using Pandas and Matplotlib after a cross-tabulation:
import pandas as pd
import matplotlib.pyplot as plt
# Assume 'df' is your DataFrame with 'CategoryA' and 'CategoryB'
# 1. Create the cross-tabulation
cross_tab = pd.crosstab(df['CategoryA'], df['CategoryB'])
# 2. Plot the stacked bar chart
ax = cross_tab.plot(kind='bar', stacked=True, figsize=(8, 5), color=['#4dabf7', '#ff922b'])
# Add labels and title
plt.xlabel("Category A")
plt.ylabel("Count")
plt.title("Composition of Category B within Category A (Stacked)")
plt.xticks(rotation=0) # Keep x-axis labels horizontal
plt.legend(title='Category B')
plt.tight_layout()
plt.show()
Alternatively, using Seaborn's histplot
:
import seaborn as sns
import matplotlib.pyplot as plt
# Use histplot for stacking
plt.figure(figsize=(8, 5))
sns.histplot(data=df, x='CategoryA', hue='CategoryB', multiple='stack', palette=['#4dabf7', '#ff922b'], shrink=0.8) # shrink adds gap between bars
plt.xlabel("Category A")
plt.ylabel("Count")
plt.title("Composition of Category B within Category A (Stacked)")
plt.tight_layout()
plt.show()
Sometimes, comparing proportions is easier if the bars are normalized to represent 100%. This is often called a "filled" stacked bar chart.
# Calculate proportions from the cross-tabulation
cross_tab_prop = cross_tab.apply(lambda x: x / x.sum() * 100, axis=1)
# Plot the 100% stacked bar chart
ax = cross_tab_prop.plot(kind='bar', stacked=True, figsize=(8, 5), color=['#4dabf7', '#ff922b'])
plt.xlabel("Category A")
plt.ylabel("Percentage (%)")
plt.title("Proportional Composition of Category B within Category A (100% Stacked)")
plt.xticks(rotation=0)
plt.legend(title='Category B', loc='center left', bbox_to_anchor=(1, 0.5)) # Move legend outside
plt.tight_layout()
plt.show()
A 100% stacked bar chart showing the percentage distribution of Category B ('A', 'B') for each level of Category A ('X', 'Y').
Both types of charts are valuable additions to your EDA toolkit, providing visual insights into the relationships hidden within your categorical data, complementing the numerical summaries obtained from cross-tabulations. Remember to always label your axes clearly and provide a legend when necessary for interpretability.
© 2025 ApX Machine Learning