While numerical variables lend themselves to analysis through measures like mean, median, and standard deviation, categorical variables require a different approach. Since calculating the 'average' category doesn't make sense, we focus instead on the distribution of observations across the different categories present in the variable. The primary way to summarize this distribution is by calculating frequency counts.
Frequency counts tell us how many times each unique category appears in our dataset. This simple count is fundamental to understanding categorical data. It helps identify:
Pandas provides a convenient method specifically for this purpose: value_counts()
. When applied to a Pandas Series (like a single column of a DataFrame), it returns a new Series containing the unique values from the original Series as its index, and their corresponding frequencies (counts) as its values. By default, the results are sorted in descending order of frequency, making it easy to see the most common categories first.
Let's assume we have a DataFrame df
with a column named customer_segment
containing categorical data.
import pandas as pd
import numpy as np
# Sample DataFrame
data = {'customer_segment': ['Standard', 'Premium', 'Standard', 'VIP', 'Premium', 'Standard', 'Standard', np.nan, 'Premium']}
df = pd.DataFrame(data)
print("Sample DataFrame:")
print(df)
# Calculate frequency counts for 'customer_segment'
segment_counts = df['customer_segment'].value_counts()
print("\nFrequency Counts for customer_segment:")
print(segment_counts)
Running this code would produce output similar to:
Sample DataFrame:
customer_segment
0 Standard
1 Premium
2 Standard
3 VIP
4 Premium
5 Standard
6 Standard
7 NaN
8 Premium
Frequency Counts for customer_segment:
Standard 4
Premium 3
VIP 1
Name: customer_segment, dtype: int64
Notice how value_counts()
automatically handles the different categories ('Standard', 'Premium', 'VIP'), counts their occurrences, and sorts them. It also, by default, excludes missing values (NaN
).
While absolute counts are useful, sometimes it's more informative to understand the proportion or percentage of each category relative to the total number of non-missing observations. This is known as the relative frequency. It helps compare distributions across datasets of different sizes.
To get relative frequencies with value_counts()
, you can use the normalize=True
argument.
# Calculate relative frequencies (proportions)
segment_proportions = df['customer_segment'].value_counts(normalize=True)
print("\nRelative Frequencies (Proportions) for customer_segment:")
print(segment_proportions)
# To display as percentages
segment_percentages = df['customer_segment'].value_counts(normalize=True) * 100
print("\nRelative Frequencies (Percentages) for customer_segment:")
print(segment_percentages.round(2).astype(str) + '%') # Nicer formatting
The output would look like this:
Relative Frequencies (Proportions) for customer_segment:
Standard 0.500
Premium 0.375
VIP 0.125
Name: customer_segment, dtype: float64
Relative Frequencies (Percentages) for customer_segment:
Standard 50.0%
Premium 37.5%
VIP 12.5%
Name: customer_segment, dtype: object
This tells us that 50% of the non-missing entries are 'Standard', 37.5% are 'Premium', and 12.5% are 'VIP'.
As mentioned, value_counts()
ignores missing (NaN
) values by default. In EDA, it's often important to know how many missing values exist for a categorical variable. You can include NaN
values in the count by setting the dropna
argument to False
.
# Calculate frequency counts including NaN values
segment_counts_incl_na = df['customer_segment'].value_counts(dropna=False)
print("\nFrequency Counts (including NaN) for customer_segment:")
print(segment_counts_incl_na)
# Calculate proportions including NaN values
segment_proportions_incl_na = df['customer_segment'].value_counts(normalize=True, dropna=False)
print("\nRelative Frequencies (including NaN) for customer_segment:")
print(segment_proportions_incl_na)
The output now includes the count and proportion of missing values:
Frequency Counts (including NaN) for customer_segment:
Standard 4
Premium 3
VIP 1
NaN 1
Name: customer_segment, dtype: int64
Relative Frequencies (including NaN) for customer_segment:
Standard 0.444444
Premium 0.333333
VIP 0.111111
NaN 0.111111
Name: customer_segment, dtype: float64
Analyzing frequency counts and proportions provides a solid quantitative understanding of each categorical variable's distribution. These numerical summaries form the basis for the next logical step: visualizing these distributions using tools like bar charts, which we will cover shortly.
© 2025 ApX Machine Learning