All Courses

Visualizing Categorical Variables: Bar Charts

After computing frequency counts for categorical variables, the next logical step is to visualize these distributions. While tables of numbers are precise, graphical representations often provide a more immediate understanding of the relative frequencies and patterns within the data. The most common and effective visualization for displaying the frequency or proportion of categories in a single categorical variable is the bar chart.

A bar chart uses rectangular bars whose lengths are proportional to the values they represent. For univariate categorical analysis, the bars typically show the count (frequency) or proportion of observations falling into each category. This makes it straightforward to compare categories at a glance.

Creating Bar Charts with Python

Python libraries like Matplotlib and Seaborn provide convenient functions for generating bar charts directly from Pandas Series or DataFrames. Seaborn, built on top of Matplotlib, offers functions specifically designed for statistical visualization, often requiring less code for common plots.

A common way to create a bar chart for category counts is using Seaborn's countplot function. It automatically calculates the frequency of each category in the specified column and plots it.

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Assume 'df' is your DataFrame and 'product_category' is the column of interest
# Example DataFrame creation:
data = {'product_category': ['Electronics', 'Clothing', 'Groceries', 'Electronics', 'Clothing', 'Electronics', 'Home Goods', 'Clothing', 'Groceries', 'Electronics']}
df = pd.DataFrame(data)

plt.figure(figsize=(8, 5)) # Optional: Adjust figure size
sns.countplot(data=df, x='product_category', palette=['#4dabf7', '#69db7c', '#ff922b', '#be4bdb']) # Using colors from palette
plt.title('Frequency of Product Categories')
plt.xlabel('Product Category')
plt.ylabel('Count')
plt.xticks(rotation=45) # Rotate labels if they overlap
plt.tight_layout() # Adjust layout
plt.show()

Alternatively, you can calculate the value counts using Pandas first and then use Matplotlib's or Pandas' plotting functions:

import matplotlib.pyplot as plt
import pandas as pd

# Assume 'df' is your DataFrame and 'product_category' is the column
# Example DataFrame creation (same as above):
data = {'product_category': ['Electronics', 'Clothing', 'Groceries', 'Electronics', 'Clothing', 'Electronics', 'Home Goods', 'Clothing', 'Groceries', 'Electronics']}
df = pd.DataFrame(data)

category_counts = df['product_category'].value_counts()

plt.figure(figsize=(8, 5))
category_counts.plot(kind='bar', color=['#4dabf7', '#69db7c', '#ff922b', '#be4bdb'])
plt.title('Frequency of Product Categories')
plt.xlabel('Product Category')
plt.ylabel('Count')
plt.xticks(rotation=45, ha='right') # Rotate and align labels
plt.tight_layout()
plt.show()

Both methods achieve a similar result. Seaborn's countplot might be slightly more direct for simple frequency plots, while the Pandas approach gives you the counts explicitly before plotting, which can be useful.

Interpreting Bar Charts

When examining a bar chart for a categorical variable, consider these points:

Compare Heights: Which categories appear most frequently? Which are least frequent? The relative heights of the bars directly indicate the relative counts.
Distribution Shape: Although "shape" is more formally associated with numerical distributions, look for any patterns. Is there a dominant category? Is the distribution relatively uniform across categories, or heavily skewed towards one or two?
Number of Categories: How many distinct categories are there? A very large number of categories might make a standard bar chart cluttered. In such cases, you might consider grouping less frequent categories into an "Other" category or using a horizontal bar chart (kind='barh' in Pandas plot or y= instead of x= in Seaborn) for better label readability.

Below is an example visualization showing the distribution of fictional customer satisfaction ratings using Plotly.

Distribution of customer satisfaction responses, showing 'Satisfied' as the most common rating.

Bar charts are a fundamental tool in your EDA toolkit for understanding the composition of categorical data. They transform frequency tables into an easily digestible visual format, highlighting the prevalence and distribution of different groups within your dataset.

Was this section helpful?