While histograms and kernel density estimates (KDEs) provide a detailed look at the shape of a distribution, sometimes you need a more concise summary, especially when comparing multiple distributions. This is where box plots, also known as box-and-whisker plots, come in handy. They provide a standardized way to visualize data based on a five-number summary.
A box plot visually represents several fundamental descriptive statistics:
Box plots offer a compact way to grasp the central tendency (median), spread (IQR), and identify potential unusual values (outliers).
seaborn.boxplot
Seaborn makes creating box plots straightforward with the seaborn.boxplot
function. It works particularly well with Pandas DataFrames.
Let's assume we have loaded the familiar 'tips' dataset:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
# Load the example dataset
tips = sns.load_dataset("tips")
# Display the first few rows
print(tips.head())
Plotting a Single Distribution
To visualize the distribution of a single numerical variable, like total_bill
, you can pass the DataFrame column directly:
# Create a box plot for the 'total_bill' column
plt.figure(figsize=(6, 4)) # Optional: Adjust figure size
sns.boxplot(y=tips["total_bill"]) # Use y for a vertical box plot
plt.title("Distribution of Total Bill Amounts")
plt.ylabel("Total Bill ($)")
plt.show()
This code generates a single vertical box plot showing the median, quartiles, whiskers, and outliers for all total bill amounts in the dataset.
Comparing Distributions Across Categories
A significant strength of box plots is comparing distributions across different groups. You typically achieve this by specifying a categorical variable for one axis (x
or y
) and the numerical variable for the other.
Let's compare total_bill
distributions for each day
of the week:
# Create box plots comparing 'total_bill' across different 'day' values
plt.figure(figsize=(8, 5)) # Optional: Adjust figure size
sns.boxplot(x="day", y="total_bill", data=tips, palette="blue") # Use x for category, y for numerical
plt.title("Total Bill Distribution by Day")
plt.xlabel("Day of the Week")
plt.ylabel("Total Bill ($)")
plt.show()
Here, x="day"
tells Seaborn to create a separate box plot for each unique value in the 'day' column, using the corresponding total_bill
values specified by y="total_bill"
. The data=tips
argument provides the DataFrame. We also used the palette
argument to apply a predefined color scheme.
You can further segment the data using the hue
parameter, creating nested comparisons within each primary category (e.g., comparing smokers and non-smokers within each day).
# Create nested box plots comparing 'total_bill' by 'day' and 'smoker' status
plt.figure(figsize=(10, 6))
sns.boxplot(x="day", y="total_bill", hue="smoker", data=tips, palette="pastel")
plt.title("Total Bill Distribution by Day and Smoker Status")
plt.xlabel("Day of the Week")
plt.ylabel("Total Bill ($)")
plt.legend(title="Smoker") # Add a legend title
plt.show()
When looking at a box plot or comparing multiple box plots:
An example box plot showing the median (center line), the box (Q1 to Q3), whiskers (extending to 1.5*IQR), and an outlier point.
Box plots are particularly effective when you want to:
They provide less detail about the specific shape of the distribution compared to histograms or KDE plots (e.g., you can't easily see if a distribution is bimodal from a standard box plot). Violin plots, discussed next, attempt to combine the summary aspects of box plots with the shape information of KDE plots.
In summary, seaborn.boxplot
offers a powerful and concise method for visualizing summary statistics and comparing distributions, making it an essential tool in your data exploration toolkit.
© 2025 ApX Machine Learning