All Courses

Hands-on Practical: Visualizing Categorical Features

Alright, let's put the concepts from this chapter into practice. We'll use Seaborn's functions to explore categorical features within a dataset. These hands-on exercises will help solidify your understanding of how to choose and create appropriate visualizations for categorical data.

For these examples, we'll use the 'tips' dataset, which is conveniently included with Seaborn. It contains information about restaurant tips, including categorical variables like the day of the week, time of day, gender of the person paying, and whether they were a smoker.

Setup

First, let's import the necessary libraries and load the dataset. We need Pandas for potential data handling (though Seaborn often handles DataFrames directly), Matplotlib for the underlying plotting engine (and potential customizations), and Seaborn itself.

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Load the example dataset
tips = sns.load_dataset("tips")

# Display the first few rows to understand the data
print(tips.head())

You should see output similar to this, showing columns like total_bill, tip, sex, smoker, day, time, and size.

   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4

Now, let's create some plots.

Exercise 1: Visualizing Category Frequencies with `countplot`

Often, the first step in analyzing categorical data is understanding how many observations fall into each category. The countplot function is perfect for this.

Task: Create a plot showing the number of tips recorded for each day of the week.

# Create the countplot
plt.figure(figsize=(8, 5)) # Optional: Adjust figure size for better readability
sns.countplot(data=tips, x='day', palette=['#74c0fc', '#4dabf7', '#339af0', '#228be6']) # Using colors from blue palette
plt.title('Number of Tips Recorded per Day')
plt.xlabel('Day of the Week')
plt.ylabel('Count')
plt.show()

Interpretation: This plot directly shows the frequency count for each category in the 'day' column. You'll likely observe that more tips were recorded on Saturday and Sunday compared to weekdays, reflecting typical restaurant patronage patterns.

Exercise 2: Comparing Aggregate Statistics with `barplot`

Bar plots are useful for comparing an average numerical value across different categories. Seaborn's barplot automatically calculates the mean (by default) and shows confidence intervals.

Task: Visualize the average total_bill for each day of the week.

# Create the barplot
plt.figure(figsize=(8, 5))
sns.barplot(data=tips, x='day', y='total_bill', palette=['#96f2d7', '#63e6be', '#38d9a9', '#20c997'], errorbar='sd') # Using colors from teal palette, showing standard deviation
plt.title('Average Total Bill per Day')
plt.xlabel('Day of the Week')
plt.ylabel('Average Total Bill ($)')
plt.show()

Average total bill amount calculated for each day, with error bars representing the confidence interval around the mean (or standard deviation if specified).

Interpretation: This plot displays the mean total_bill for each day. The vertical lines (error bars) indicate the uncertainty around the mean (typically a 95% confidence interval). This helps visualize if the differences in average bill amounts between days are statistically meaningful.

Exercise 3: Examining Distributions Across Categories with `boxplot`

To understand the spread and central tendency of a numerical variable for different categories, box plots are excellent.

Task: Compare the distribution of tip amounts between smokers and non-smokers.

# Create the boxplot
plt.figure(figsize=(7, 5))
sns.boxplot(data=tips, x='smoker', y='tip', palette=['#ffc9c9', '#74c0fc']) # Using red and blue palette colors
plt.title('Distribution of Tip Amounts by Smoking Status')
plt.xlabel('Smoker')
plt.ylabel('Tip Amount ($)')
plt.show()

Interpretation: Each box shows the median (middle line), the interquartile range (IQR, the box itself), and potential outliers (points outside the whiskers). By comparing the boxes for 'Yes' and 'No' smokers, you can assess differences in typical tip amounts, the spread of tips, and the presence of unusually high or low tips within each group.

Exercise 4: Showing Individual Data Points with `swarmplot`

Sometimes, seeing every data point is informative, especially when comparing distributions across categories with a moderate number of observations. swarmplot arranges points so they don't overlap, giving a sense of density.

Task: Visualize individual tip amounts based on the time of the meal (Lunch or Dinner).

# Create the swarmplot
plt.figure(figsize=(7, 5))
sns.swarmplot(data=tips, x='time', y='tip', palette=['#ffe066', '#fd7e14']) # Using yellow and orange palette colors
plt.title('Individual Tip Amounts by Time of Day')
plt.xlabel('Time')
plt.ylabel('Tip Amount ($)')
plt.show()

Interpretation: This plot displays each individual tip as a distinct point, positioned according to its value and category ('Lunch' or 'Dinner'). The horizontal arrangement within each category helps visualize the density of tips at different values. It complements the boxplot by showing the raw data points that contribute to the summary statistics.

Exercise 5: Comparing Estimates and Trends with `pointplot`

Point plots are effective for comparing point estimates (like the mean) and their confidence intervals across different categories, particularly when looking for trends or interactions with a second categorical variable (using hue).

Task: Show the average tip amount according to the size of the party (number of people), separated by the sex of the payer.

# Create the pointplot
plt.figure(figsize=(9, 6))
sns.pointplot(data=tips, x='size', y='tip', hue='sex', palette={'Male': '#1c7ed6', 'Female': '#f06595'}, markers=['o', 's'], linestyles=['-', '--']) # Using blue and pink palette colors
plt.title('Average Tip Amount by Party Size and Payer Gender')
plt.xlabel('Party Size')
plt.ylabel('Average Tip Amount ($)')
plt.legend(title='Payer Gender')
plt.show()

Interpretation: This plot connects the average tip amount (points) for each party size with lines, separately for male and female payers. The vertical lines represent confidence intervals. This visualization makes it easy to compare how average tips change with party size and whether this trend differs between genders. For example, you might observe if tips increase more steeply with party size for one gender compared to the other.

Summary

These exercises demonstrated how to use several important Seaborn functions (countplot, barplot, boxplot, swarmplot, pointplot) to effectively visualize categorical data. You learned to display frequencies, compare average values, examine distributions, show individual data points, and analyze trends across categories. Experiment further by swapping variables, trying different plot types (like violinplot or stripplot), and exploring the various customization options available in Seaborn to gain deeper insights from your own categorical data.

Was this section helpful?

Hands-on Practical: Visualizing Categorical Features

Setup

Exercise 1: Visualizing Category Frequencies with countplot

Exercise 2: Comparing Aggregate Statistics with barplot

Exercise 3: Examining Distributions Across Categories with boxplot

Exercise 4: Showing Individual Data Points with swarmplot