When analyzing relationships between two variables, the approach depends heavily on the types of variables involved. We've seen how scatter plots and correlation help understand numerical vs. numerical relationships. Now, let's examine how to explore the association between two categorical variables. Simple scatter plots aren't suitable here, as plotting discrete categories doesn't reveal much structure. Instead, we rely on frequency counting within the combinations of categories.
The primary tool for this is the cross-tabulation, often called a contingency table. A cross-tabulation is essentially a table that summarizes the frequency (count) of observations for each combination of categories of two categorical variables. It allows us to see how the distribution of one categorical variable changes across the levels of another.
crosstab
The Pandas library provides a straightforward function, pd.crosstab()
, specifically designed for this purpose. Its basic syntax involves passing the two categorical Series (columns) you want to compare.
import pandas as pd
# Assume 'df' is your DataFrame
# 'category_col_1' and 'category_col_2' are the names of your categorical columns
contingency_table = pd.crosstab(df['category_col_1'], df['category_col_2'])
print(contingency_table)
Let's illustrate with a hypothetical example. Imagine a dataset survey_df
with columns satisfaction_level
('Low', 'Medium', 'High') and region
('North', 'South', 'East', 'West').
# Sample Data Creation (for illustration)
data = {'satisfaction_level': ['Low', 'Medium', 'High', 'Medium', 'Low', 'High', 'Medium', 'Low', 'High', 'Medium'],
'region': ['North', 'South', 'North', 'East', 'West', 'South', 'West', 'North', 'East', 'South']}
survey_df = pd.DataFrame(data)
# Create the cross-tabulation
cross_tab_results = pd.crosstab(survey_df['satisfaction_level'], survey_df['region'])
print(cross_tab_results)
Running this would produce output similar to this table:
region East North South West
satisfaction_level
High 1 1 1 0
Low 0 2 0 1
Medium 1 0 2 1
Interpreting the Raw Counts:
Each cell in this table shows the number of observations that fall into that specific combination of categories. For instance:
The table also implicitly includes row totals (e.g., total 'High' satisfaction observations) and column totals (e.g., total 'North' region observations), although pd.crosstab
doesn't display them by default. You can add these using the margins=True
argument if needed.
While raw counts are informative, comparing distributions across categories with different total counts can be difficult. For example, if the 'North' region had many more respondents overall than the 'West' region, simply comparing the raw counts of 'High' satisfaction might be misleading.
To address this, we can normalize the table to show proportions or percentages. The normalize
argument in pd.crosstab()
controls this:
normalize='index'
: Calculates proportions across each row (row totals sum to 1). This shows the distribution of the column variable for each category of the row variable.normalize='columns'
: Calculates proportions down each column (column totals sum to 1). This shows the distribution of the row variable for each category of the column variable.normalize='all'
: Calculates proportions based on the grand total (all cell values sum to 1). This shows the proportion of the total dataset falling into each cell combination.Example: Normalizing by Row ('index')
Let's see how the satisfaction levels are distributed across regions:
# Normalize by row (shows regional distribution for each satisfaction level)
row_normalized_tab = pd.crosstab(survey_df['satisfaction_level'], survey_df['region'], normalize='index')
# Multiply by 100 and round for percentage view (optional)
print((row_normalized_tab * 100).round(1))
Output might look like:
region East North South West
satisfaction_level
High 33.3 33.3 33.3 0.0
Low 0.0 66.7 0.0 33.3
Medium 25.0 0.0 50.0 25.0
Interpretation: Of those with 'High' satisfaction, 33.3% are in the East, 33.3% in the North, and 33.3% in the South. Of those with 'Low' satisfaction, 66.7% are in the North and 33.3% are in the West.
Example: Normalizing by Column ('columns')
Now let's see the satisfaction distribution within each region:
# Normalize by column (shows satisfaction distribution within each region)
col_normalized_tab = pd.crosstab(survey_df['satisfaction_level'], survey_df['region'], normalize='columns')
# Multiply by 100 and round for percentage view (optional)
print((col_normalized_tab * 100).round(1))
Output might look like:
region East North South West
satisfaction_level
High 50.0 33.3 33.3 0.0
Low 0.0 66.7 0.0 50.0
Medium 50.0 0.0 66.7 50.0
Interpretation: In the 'East' region, 50% have 'High' satisfaction and 50% have 'Medium'. In the 'North' region, 33.3% have 'High' satisfaction and 66.7% have 'Low'.
By examining these tables (both raw counts and normalized versions), you can start to identify potential associations. Ask questions like:
In our example, the normalized tables suggest some potential patterns: 'Low' satisfaction seems more concentrated in the 'North' and 'West', while 'High' satisfaction appears absent in the 'West'. 'Medium' satisfaction is most prevalent in the 'South'.
Keep in mind that cross-tabulation reveals observed patterns in your specific sample data. Establishing statistical significance often requires formal hypothesis tests (like the Chi-squared test of independence), which are typically covered in more detail during statistical modeling rather than initial EDA. For exploration, identifying these potential relationships is the main goal.
Cross-tabulations provide a compact numerical summary. To make these patterns more visually apparent, techniques like stacked or grouped bar charts are commonly used, which we will discuss next.
© 2025 ApX Machine Learning