Okay, let's put the concepts from this chapter into practice. We've explored several Seaborn functions designed to help us understand how data is distributed. Now, we'll apply histplot
, kdeplot
, boxplot
, violinplot
, and jointplot
to a real dataset to gain insights.
First, ensure you have the necessary libraries imported. We'll use Seaborn for plotting, Matplotlib for potential adjustments (like controlling figure size), and Pandas to handle our data. We will use the built-in 'tips' dataset available directly within Seaborn. This dataset contains information about tips given in a restaurant.
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
# Load the example dataset
tips = sns.load_dataset("tips")
# Display the first few rows to understand the data
print(tips.head())
The output will show columns like total_bill
, tip
, sex
, smoker
, day
, time
, and size
. These columns provide a mix of numerical and categorical data, perfect for exploring distributions.
Let's start by examining the distribution of the total_bill
amount. A histogram is a great way to see the frequency of different bill ranges.
histplot
)We can create a histogram of the total_bill
column using sns.histplot()
.
plt.figure(figsize=(8, 5)) # Control figure size for better readability
sns.histplot(data=tips, x="total_bill")
plt.title("Distribution of Total Bill Amounts")
plt.xlabel("Total Bill ($)")
plt.ylabel("Frequency")
plt.show()
Distribution of total bill amounts. Most bills are between 10and20.
Sometimes, a smooth curve representing the estimated probability density is helpful. We can add this using kde=True
or create it separately with sns.kdeplot()
.
plt.figure(figsize=(8, 5))
sns.histplot(data=tips, x="total_bill", kde=True, color="#1098ad") # Using kde=True
plt.title("Distribution of Total Bill Amounts with KDE")
plt.xlabel("Total Bill ($)")
plt.ylabel("Frequency / Density")
plt.show()
# Or using kdeplot separately
plt.figure(figsize=(8, 5))
sns.kdeplot(data=tips, x="total_bill", color="#f76707", fill=True) # fill adds color under the curve
plt.title("KDE of Total Bill Amounts")
plt.xlabel("Total Bill ($)")
plt.ylabel("Density")
plt.show()
Smoothed Kernel Density Estimate (KDE) of total bill amounts, highlighting the peak around 15ā20 and the right skew.
The KDE provides a smoother representation of the distribution's shape compared to the histogram's discrete bins.
Often, we want to compare the distribution of a numerical variable across different categories. Let's compare the total_bill
distribution based on the day
of the week.
boxplot
)A box plot summarizes the distribution using quartiles. The box represents the interquartile range (IQR), the line inside is the median, and the whiskers typically extend to 1.5 times the IQR. Points outside the whiskers are potential outliers.
plt.figure(figsize=(8, 6))
sns.boxplot(data=tips, x="day", y="total_bill", palette="viridis") # using a different palette
plt.title("Total Bill Distribution by Day")
plt.xlabel("Day of the Week")
plt.ylabel("Total Bill ($)")
plt.show()
Comparison of total bill distribution by day. Weekend days (Sat, Sun) tend to have higher median bills and a wider range compared to weekdays.
violinplot
)A violin plot combines a box plot with a KDE plot, showing the distribution's shape alongside the summary statistics.
plt.figure(figsize=(8, 6))
sns.violinplot(data=tips, x="day", y="total_bill", palette="plasma", inner="quartile") # inner='quartile' shows quartiles inside
plt.title("Total Bill Distribution by Day (Violin Plot)")
plt.xlabel("Day of the Week")
plt.ylabel("Total Bill ($)")
plt.show()
Violin plot comparing total bill distribution by day. The shape of the violins shows the density, confirming the wider spread and higher concentration of larger bills on weekends.
Violin plots give a richer understanding of the distribution's shape than standard box plots, showing multimodality if present.
To understand the relationship between two numerical variables and their individual distributions simultaneously, we use sns.jointplot()
. Let's look at total_bill
versus tip
.
jointplot
)This function creates a scatter plot showing the relationship between the two variables in the center, with histograms (or KDEs) for each variable along the axes.
sns.jointplot(data=tips, x="total_bill", y="tip", kind="scatter", color="#37b24d") # kind='scatter' is default
plt.suptitle("Joint Distribution of Total Bill and Tip", y=1.02) # Adjust title position
plt.show()
# We can also use kind='kde' for density contours
sns.jointplot(data=tips, x="total_bill", y="tip", kind="kde", color="#7048e8")
plt.suptitle("Joint KDE of Total Bill and Tip", y=1.02)
plt.show()
Joint plot showing the relationship between total bill and tip amount. The scatter plot indicates a positive correlation, while the histograms on the margins show the individual distributions of each variable.
This practical exercise demonstrates how Seaborn's distribution plots provide powerful tools for exploring and comparing data distributions, both for single variables and relationships between variables. Experiment further by applying these plots to other numerical and categorical columns in the tips
dataset or your own data.
Ā© 2025 ApX Machine Learning