While bar plots give us a sense of the central tendency (like the mean or median) for a numerical variable across different categories, they don't tell us much about how the data is spread out within each category. Are the values tightly clustered, or widely dispersed? Are there many outliers? To answer these questions and compare the overall distributions, we can use box plots.
A box plot (or box-and-whisker plot) provides a concise visual summary of a dataset's distribution. It displays five key statistics:
Seaborn's boxplot
function is specifically designed to create these visualizations, making it easy to compare distributions across different categories.
seaborn.boxplot
The basic syntax involves specifying the categorical variable for one axis (usually x
), the numerical variable for the other axis (usually y
), and the DataFrame containing the data using the data
parameter.
Let's use the familiar 'tips' dataset that comes with Seaborn. We can compare the distribution of total bill amounts for each day of the week.
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
# Load the example dataset
tips = sns.load_dataset("tips")
# Create the box plot
plt.figure(figsize=(8, 5)) # Adjust figure size for better readability
sns.boxplot(x="day", y="total_bill", data=tips, palette=["#74c0fc", "#ffc078", "#8ce99a", "#ffc9c9"])
# Add title and labels (optional but recommended)
plt.title("Distribution of Total Bill Amounts by Day")
plt.xlabel("Day of the Week")
plt.ylabel("Total Bill ($)")
# Show the plot
plt.show()
Distribution of total bill amounts across different days of the week using Seaborn's
boxplot
.
From the plot above, we can draw several observations:
Compared to a bar plot showing only the average bill per day, the box plot gives us a much richer understanding of how bill amounts vary within each day.
Like other Seaborn functions, boxplot
offers various customization options.
Orientation: You can create horizontal box plots by swapping x
and y
or setting orient='h'
.
# Horizontal box plot
sns.boxplot(x="total_bill", y="day", data=tips, orient='h', palette=["#74c0fc", "#ffc078", "#8ce99a", "#ffc9c9"])
plt.title("Distribution of Total Bill Amounts by Day")
plt.xlabel("Total Bill ($)")
plt.ylabel("Day of the Week")
plt.show()
Order: Control the order in which categories appear using the order
parameter, passing a list of category names.
# Specify order of days
day_order = ["Thur", "Fri", "Sat", "Sun"]
sns.boxplot(x="day", y="total_bill", data=tips, order=day_order, palette=["#74c0fc", "#ffc078", "#8ce99a", "#ffc9c9"])
# ... (add titles/labels and show plot)
Hue: You can add another categorical dimension using the hue
parameter, which creates separate, side-by-side boxes for each level of the hue
variable within each main category on the x-axis. For example, you could compare bills by day, split by whether the customer was a smoker.
# Add 'smoker' as a hue dimension
sns.boxplot(x="day", y="total_bill", hue="smoker", data=tips, palette="pastel")
plt.title("Distribution of Total Bill by Day and Smoker Status")
plt.xlabel("Day of the Week")
plt.ylabel("Total Bill ($)")
plt.show()
Box plots are particularly effective when you want to compare the distributions of a numerical variable across several groups defined by one or more categorical variables. They provide a quick way to assess differences in central tendency, spread, and the presence of outliers between the groups.
© 2025 ApX Machine Learning