Understanding how your data is distributed and how different variables relate to each other is fundamental to exploratory data analysis (EDA) and preparing data for machine learning models. Seaborn, building upon Matplotlib's foundation, provides specialized functions that make visualizing these aspects straightforward and insightful. This section focuses on using Seaborn to effectively examine single-variable distributions and multi-variable relationships.
Before investigating relationships, it's often useful to understand the characteristics of individual variables. How are the values spread out? Are they symmetric, skewed? Are there outliers? Seaborn offers several plot types for this purpose.
Histograms are a familiar way to visualize the frequency distribution of a continuous variable by binning the data and displaying counts per bin. Seaborn's histplot
function simplifies this and can optionally overlay a Kernel Density Estimate (KDE). A KDE plot smooths the histogram, providing an estimate of the underlying probability density function.
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# Sample data simulation
np.random.seed(42)
data = pd.DataFrame({
'feature_A': np.random.normal(loc=5, scale=2, size=200),
'feature_B': np.random.exponential(scale=5, size=200),
'category': np.random.choice(['Group 1', 'Group 2', 'Group 3'], size=200, p=[0.4, 0.3, 0.3])
})
# Histogram with KDE
plt.figure(figsize=(8, 5))
sns.histplot(data=data, x='feature_A', kde=True, color='#339af0')
plt.title('Distribution of Feature A (Histogram with KDE)')
plt.xlabel('Feature A Value')
plt.ylabel('Frequency / Density')
plt.show()
# Separate KDE plot
plt.figure(figsize=(8, 5))
sns.kdeplot(data=data, x='feature_B', fill=True, color='#20c997')
plt.title('Density Plot of Feature B')
plt.xlabel('Feature B Value')
plt.ylabel('Density')
plt.show()
The histogram shows clear bins, while the KDE provides a smoother representation of the distribution's shape. Use histplot
when you want to see frequency counts within specific ranges, and kdeplot
when you are more interested in the overall shape and smoothness of the distribution.
{"layout": {"title": "Histogram and KDE for Feature A", "xaxis": {"title": "Feature A"}, "yaxis": {"title": "Count"}, "bargap": 0.1, "width": 600, "height": 400}, "data": [{"type": "histogram", "x": [4.98, 4.31, 5.64, 7.35, 3.68, 3.98, 6.53, 5.66, 3.13, 3.81, 4.88, 2.94, 4.76, 5.08, 6.58, 3.22, 7.78, 6.13, 4.43, 3.57, 6.04, 4.15, 3.96, 4.64, 4.17, 2.32, 4.41, 4.77, 4.24, 6.06, 6.53, 5.37, 5.04, 3.41, 3.78, 5.79, 3.33, 5.19, 3.95, 5.66, 7.00, 5.95, 5.53, 6.79, 5.16, 6.87, 5.34, 5.24, 4.16, 4.99, 5.21, 4.34, 3.29, 5.97, 6.49, 4.70, 5.30, 5.30, 4.63, 4.17, 4.14, 3.81, 4.14, 3.07, 5.70, 6.00, 5.29, 5.58, 5.15, 3.44, 7.06, 4.77, 5.94, 5.07, 4.50, 2.63, 7.37, 3.94, 5.49, 3.83, 6.72, 3.08, 3.27, 5.14, 5.19, 5.29, 5.45, 4.02, 4.58, 3.81, 4.87, 6.28, 5.49, 7.74, 3.59, 4.19, 4.08, 4.08, 7.28, 3.73, 4.15, 4.57, 5.04, 4.94, 4.19, 4.44, 2.53, 5.68, 7.21, 4.84, 4.13, 4.70, 3.16, 5.01, 5.80, 6.22, 4.50, 5.44, 5.70, 4.71, 4.42, 4.93, 5.57, 4.04, 7.55, 7.09, 5.02, 5.30, 5.01, 5.20, 5.00, 3.84, 6.15, 4.28, 4.44, 5.19, 6.65, 2.86, 2.76, 4.93, 4.68, 4.80, 5.95, 4.61, 7.45, 5.13, 5.71, 7.02, 4.69, 3.37, 5.80, 3.01, 6.18, 4.89, 3.93, 6.02, 4.10, 5.88, 4.06, 4.26, 5.07, 5.08, 6.31, 5.12, 4.05, 3.56, 4.40, 4.34, 3.99, 4.57, 6.20, 4.53, 2.89, 5.10, 3.81, 3.11, 4.84, 4.94, 5.71, 5.43, 5.96, 5.36, 5.71, 3.42, 4.20, 5.32, 6.04, 6.62, 4.94, 4.41, 5.31, 5.17], "marker": {"color": "#4dabf7"}, "name": "Histogram"}, {"type": "scatter", "x": [1.4, 1.8, 2.2, 2.6, 3.0, 3.4, 3.8, 4.2, 4.6, 5.0, 5.4, 5.8, 6.2, 6.6, 7.0, 7.4, 7.8, 8.2, 8.6], "y": [0.6, 1.3, 3.6, 8.0, 13.7, 20.2, 26.4, 31.7, 35.2, 36.5, 35.5, 32.7, 28.5, 23.7, 19.1, 14.9, 11.4, 8.6, 6.4], "mode": "lines", "line": {"color": "#1c7ed6", "width": 1.5}, "yaxis": "y2", "name": "KDE"}, ], "layout": {"yaxis2": {"overlaying": "y", "side": "right", "showgrid": false, "title": "Density"}, "showlegend": false}}
Histogram of Feature A counts overlaid with its Kernel Density Estimate.
Box plots (or box-and-whisker plots) summarize a distribution using quartiles. The box represents the interquartile range (IQR) between the 25th (Q1) and 75th (Q3) percentiles, with a line indicating the median (Q2). Whiskers typically extend to 1.5 times the IQR from the box, and points beyond the whiskers are often considered potential outliers.
Violin plots combine a box plot with a KDE. The width of the "violin" shape represents the density of the data at different values. This allows you to see the summary statistics and the shape of the distribution, including potential multiple peaks (multimodality) that a box plot might hide.
# Box plot
plt.figure(figsize=(8, 5))
sns.boxplot(data=data, y='feature_A', color='#ffc078')
plt.title('Box Plot of Feature A')
plt.ylabel('Feature A Value')
plt.show()
# Violin plot
plt.figure(figsize=(8, 5))
sns.violinplot(data=data, y='feature_B', color='#b197fc')
plt.title('Violin Plot of Feature B')
plt.ylabel('Feature B Value')
plt.show()
Box plot and Violin plot showing the distribution summary and density shape for Feature B.
Use box plots for a concise summary of central tendency and spread, especially when comparing many groups. Use violin plots when you need to understand the shape of the distribution more fully, such as identifying multiple peaks or skewness, alongside the summary statistics.
Machine learning often involves understanding how variables interact. Does one variable increase as another increases? Do different categories exhibit different patterns? Seaborn excels at visualizing these relationships.
Scatter plots are the standard tool for examining the relationship between two continuous variables. Each point represents an observation, plotted according to its values on the two axes. Seaborn's scatterplot
makes this easy and allows for incorporating a third variable using color (hue
), size (size
), or style (style
).
plt.figure(figsize=(8, 6))
sns.scatterplot(data=data, x='feature_A', y='feature_B', hue='category', palette=['#fa5252', '#4c6ef5', '#82c91e'])
plt.title('Relationship between Feature A and Feature B by Category')
plt.xlabel('Feature A')
plt.ylabel('Feature B')
plt.legend(title='Category')
plt.show()
{"layout": {"title": "Scatter Plot of Feature A vs Feature B", "xaxis": {"title": "Feature A"}, "yaxis": {"title": "Feature B"}, "legend": {"title": {"text": "Category"}}, "width": 600, "height": 450}, "data": [{"type": "scatter", "x": [4.98, 4.31, 5.64, 7.35, 3.68, 3.98, 6.53, 5.66, 3.13, 3.81, 4.88, 2.94, 4.76, 5.08, 6.58, 3.22, 7.78, 6.13, 4.43, 3.57, 6.04, 4.15, 3.96, 4.64, 4.17, 2.32, 4.41, 4.77, 4.24, 6.06, 6.53, 5.37, 5.04, 3.41, 3.78, 5.79, 3.33, 5.19, 3.95, 5.66, 7.00, 5.95, 5.53, 6.79, 5.16, 6.87, 5.34, 5.24, 4.16, 4.99, 5.21, 4.34, 3.29, 5.97, 6.49, 4.70, 5.30, 5.30, 4.63, 4.17, 4.14, 3.81, 4.14, 3.07, 5.70, 6.00, 5.29, 5.58, 5.15, 3.44, 7.06, 4.77, 5.94, 5.07, 4.50, 2.63, 7.37, 3.94, 5.49, 3.83, 6.72, 3.08, 3.27, 5.14, 5.19, 5.29, 5.45, 4.02, 4.58, 3.81, 4.87, 6.28, 5.49, 7.74, 3.59, 4.19, 4.08, 4.08, 7.28, 3.73, 4.15, 4.57, 5.04, 4.94, 4.19, 4.44, 2.53, 5.68, 7.21, 4.84, 4.13, 4.70, 3.16, 5.01, 5.80, 6.22, 4.50, 5.44, 5.70, 4.71, 4.42, 4.93, 5.57, 4.04, 7.55, 7.09, 5.02, 5.30, 5.01, 5.20, 5.00, 3.84, 6.15, 4.28, 4.44, 5.19, 6.65, 2.86, 2.76, 4.93, 4.68, 4.80, 5.95, 4.61, 7.45, 5.13, 5.71, 7.02, 4.69, 3.37, 5.80, 3.01, 6.18, 4.89, 3.93, 6.02, 4.10, 5.88, 4.06, 4.26, 5.07, 5.08, 6.31, 5.12, 4.05, 3.56, 4.40, 4.34, 3.99, 4.57, 6.20, 4.53, 2.89, 5.10, 3.81, 3.11, 4.84, 4.94, 5.71, 5.43, 5.96, 5.36, 5.71, 3.42, 4.20, 5.32, 6.04, 6.62, 4.94, 4.41, 5.31, 5.17], "y": [1.64, 1.64, 0.22, 1.16, 1.39, 7.19, 2.01, 0.68, 1.56, 3.51, 4.04, 0.69, 0.52, 1.11, 0.71, 1.41, 1.53, 1.62, 1.68, 1.93, 4.54, 7.52, 1.41, 8.14, 2.11, 6.34, 5.39, 2.41, 4.92, 0.08, 1.54, 11.81, 10.66, 6.75, 7.10, 5.62, 0.08, 2.08, 2.37, 0.45, 5.96, 8.06, 1.86, 1.71, 1.99, 2.59, 7.26, 11.19, 2.68, 7.96, 10.66, 1.94, 2.67, 1.58, 11.11, 3.77, 4.47, 3.04, 8.09, 1.77, 6.30, 1.07, 1.88, 13.67, 0.46, 0.75, 0.32, 3.65, 3.08, 1.35, 8.08, 1.27, 0.19, 2.16, 0.33, 2.22, 0.70, 3.68, 2.31, 0.53, 0.84, 7.43, 1.60, 1.53, 3.43, 1.02, 10.05, 0.84, 4.21, 5.29, 2.46, 2.33, 4.24, 2.11, 1.79, 11.02, 12.24, 1.98, 7.16, 15.70, 1.12, 1.17, 1.62, 2.02, 1.68, 8.81, 1.35, 5.89, 2.30, 7.21, 0.44, 6.81, 3.06, 0.78, 3.02, 3.80, 2.70, 3.50, 11.77, 2.78, 1.91, 1.24, 3.13, 5.91, 5.77, 5.01, 1.15, 1.64, 10.87, 6.32, 5.08, 7.54, 1.39, 7.37, 3.21, 5.46, 1.42, 3.76, 0.65, 4.95, 0.82, 7.04, 5.39, 1.14, 2.61, 1.43, 3.78, 6.15, 1.74, 1.70, 1.56, 5.01, 7.64, 1.59, 6.23, 3.79, 5.18, 6.03, 7.41, 10.88, 0.98, 6.92, 1.64, 2.95, 5.17, 5.00, 10.04, 3.07, 5.54, 6.70, 1.92, 5.28, 1.67, 1.68, 1.03, 0.49, 3.62, 3.06, 1.31, 1.38, 1.63, 4.74, 0.88, 0.44, 7.49, 5.00, 3.06, 0.78, 3.22, 1.77, 2.12, 1.84, 0.25, 6.43, 8.42, 1.77, 2.29], "mode": "markers", "marker": {"color": ["#fa5252", "#4c6ef5", "#4c6ef5", "#4c6ef5", "#fa5252", "#fa5252", "#82c91e", "#fa5252", "#fa5252", "#82c91e", "#4c6ef5", "#fa5252", "#fa5252", "#fa5252", "#82c91e", "#82c91e", "#4c6ef5", "#fa5252", "#fa5252", "#fa5252", "#82c91e", "#82c91e", "#4c6ef5", "#fa5252", "#fa5252", "#fa5252", "#fa5252", "#4c6ef5", "#fa5252", "#fa5252", "#82c91e", "#4c6ef5", "#fa5252", "#fa5252", "#4c6ef5", "#82c91e", "#4c6ef5", "#fa5252", "#4c6ef5", "#4c6ef5", "#4c6ef5", "#4c6ef5", "#82c91e", "#82c91e", "#82c91e", "#fa5252", "#4c6ef5", "#82c91e", "#fa5252", "#4c6ef5", "#fa5252", "#82c91e", "#4c6ef5", "#4c6ef5", "#4c6ef5", "#fa5252", "#4c6ef5", "#fa5252", "#82c91e", "#fa5252", "#4c6ef5", "#fa5252", "#4c6ef5", "#4c6ef5", "#fa5252", "#82c91e", "#fa5252", "#82c91e", "#4c6ef5", "#4c6ef5", "#fa5252", "#4c6ef5", "#82c91e", "#82c91e", "#82c91e", "#4c6ef5", "#4c6ef5", "#fa5252", "#fa5252", "#fa5252", "#82c91e", "#4c6ef5", "#fa5252", "#fa5252", "#82c91e", "#82c91e", "#82c91e", "#4c6ef5", "#4c6ef5", "#4c6ef5", "#4c6ef5", "#fa5252", "#4c6ef5", "#82c91e", "#4c6ef5", "#4c6ef5", "#4c6ef5", "#4c6ef5", "#fa5252", "#fa5252", "#4c6ef5", "#4c6ef5", "#fa5252", "#4c6ef5", "#82c91e", "#fa5252", "#fa5252", "#fa5252", "#4c6ef5", "#fa5252", "#82c91e", "#82c91e", "#fa5252", "#82c91e", "#4c6ef5", "#fa5252", "#82c91e", "#82c91e", "#4c6ef5", "#82c91e", "#4c6ef5", "#82c91e", "#fa5252", "#4c6ef5", "#82c91e", "#fa5252", "#4c6ef5", "#82c91e", "#82c91e", "#fa5252", "#82c91e", "#fa5252", "#fa5252", "#82c91e", "#4c6ef5", "#82c91e", "#4c6ef5", "#4c6ef5", "#fa5252", "#fa5252", "#fa5252", "#4c6ef5", "#82c91e", "#82c91e", "#fa5252", "#82c91e", "#82c91e", "#82c91e", "#fa5252", "#4c6ef5", "#4c6ef5", "#82c91e", "#82c91e", "#fa5252", "#4c6ef5", "#4c6ef5", "#fa5252", "#fa5252", "#fa5252", "#82c91e", "#82c91e", "#82c91e", "#82c91e", "#82c91e", "#4c6ef5", "#82c91e", "#82c91e", "#4c6ef5", "#4c6ef5", "#4c6ef5", "#4c6ef5", "#4c6ef5", "#4c6ef5", "#fa5252", "#fa5252", "#fa5252", "#fa5252", "#4c6ef5", "#fa5252", "#4c6ef5", "#4c6ef5", "#82c91e", "#82c91e", "#82c91e", "#82c91e", "#fa5252", "#4c6ef5", "#fa5252", "#fa5252", "#4c6ef5", "#82c91e", "#fa5252", "#fa5252", "#fa5252", "#fa5252", "#fa5252", "#fa5252", "#82c91e", "#4c6ef5", "#fa5252", "#fa5252"], "name": "Group 1"}, {"type": "scatter", "mode": "markers", "showlegend": false, "marker": {"color": ["#fa5252", "#4c6ef5", "#4c6ef5", "#4c6ef5", "#fa5252", "#fa5252", "#82c91e", "#fa5252", "#fa5252", "#82c91e", "#4c6ef5", "#fa5252", "#fa5252", "#fa5252", "#82c91e", "#82c91e", "#4c6ef5", "#fa5252", "#fa5252", "#fa5252", "#82c91e", "#82c91e", "#4c6ef5", "#fa5252", "#fa5252", "#fa5252", "#fa5252", "#4c6ef5", "#fa5252", "#fa5252", "#82c91e", "#4c6ef5", "#fa5252", "#fa5252", "#4c6ef5", "#82c91e", "#4c6ef5", "#fa5252", "#4c6ef5", "#4c6ef5", "#4c6ef5", "#4c6ef5", "#82c91e", "#82c91e", "#82c91e", "#fa5252", "#4c6ef5", "#82c91e", "#fa5252", "#4c6ef5", "#fa5252", "#82c91e", "#4c6ef5", "#4c6ef5", "#4c6ef5", "#fa5252", "#4c6ef5", "#fa5252", "#82c91e", "#fa5252", "#4c6ef5", "#fa5252", "#4c6ef5", "#4c6ef5", "#fa5252", "#82c91e", "#fa5252", "#82c91e", "#4c6ef5", "#4c6ef5", "#fa5252", "#4c6ef5", "#82c91e", "#82c91e", "#82c91e", "#4c6ef5", "#4c6ef5", "#fa5252", "#fa5252", "#fa5252", "#82c91e", "#4c6ef5", "#fa5252", "#fa5252", "#82c91e", "#82c91e", "#82c91e", "#4c6ef5", "#4c6ef5", "#4c6ef5", "#4c6ef5", "#fa5252", "#4c6ef5", "#82c91e", "#4c6ef5", "#4c6ef5", "#4c6ef5", "#4c6ef5", "#fa5252", "#fa5252", "#4c6ef5", "#4c6ef5", "#fa5252", "#4c6ef5", "#82c91e", "#fa5252", "#fa5252", "#fa5252", "#4c6ef5", "#fa5252", "#82c91e", "#82c91e", "#fa5252", "#82c91e", "#4c6ef5", "#fa5252", "#82c91e", "#82c91e", "#4c6ef5", "#82c91e", "#4c6ef5", "#82c91e", "#fa5252", "#4c6ef5", "#82c91e", "#fa5252", "#4c6ef5", "#82c91e", "#82c91e", "#fa5252", "#82c91e", "#fa5252", "#fa5252", "#82c91e", "#4c6ef5", "#82c91e", "#4c6ef5", "#4c6ef5", "#fa5252", "#fa5252", "#fa5252", "#4c6ef5", "#82c91e", "#82c91e", "#fa5252", "#82c91e", "#82c91e", "#82c91e", "#fa5252", "#4c6ef5", "#4c6ef5", "#82c91e", "#82c91e", "#fa5252", "#4c6ef5", "#4c6ef5", "#fa5252", "#fa5252", "#fa5252", "#82c91e", "#82c91e", "#82c91e", "#82c91e", "#82c91e", "#4c6ef5", "#82c91e", "#82c91e", "#4c6ef5", "#4c6ef5", "#4c6ef5", "#4c6ef5", "#4c6ef5", "#4c6ef5", "#fa5252", "#fa5252", "#fa5252", "#fa5252", "#4c6ef5", "#fa5252", "#4c6ef5", "#4c6ef5", "#82c91e", "#82c91e", "#82c91e", "#82c91e", "#fa5252", "#4c6ef5", "#fa5252", "#fa5252", "#4c6ef5", "#82c91e", "#fa5252", "#fa5252", "#fa5252", "#fa5252", "#fa5252", "#fa5252", "#82c91e", "#4c6ef5", "#fa5252", "#fa5252"], "name": "Group 2"}, {"type": "scatter", "mode": "markers", "showlegend": false, "marker": {"color": ["#fa5252", "#4c6ef5", "#4c6ef5", "#4c6ef5", "#fa5252", "#fa5252", "#82c91e", "#fa5252", "#fa5252", "#82c91e", "#4c6ef5", "#fa5252", "#fa5252", "#fa5252", "#82c91e", "#82c91e", "#4c6ef5", "#fa5252", "#fa5252", "#fa5252", "#82c91e", "#82c91e", "#4c6ef5", "#fa5252", "#fa5252", "#fa5252", "#fa5252", "#4c6ef5", "#fa5252", "#fa5252", "#82c91e", "#4c6ef5", "#fa5252", "#fa5252", "#4c6ef5", "#82c91e", "#4c6ef5", "#fa5252", "#4c6ef5", "#4c6ef5", "#4c6ef5", "#4c6ef5", "#82c91e", "#82c91e", "#82c91e", "#fa5252", "#4c6ef5", "#82c91e", "#fa5252", "#4c6ef5", "#fa5252", "#82c91e", "#4c6ef5", "#4c6ef5", "#4c6ef5", "#fa5252", "#4c6ef5", "#fa5252", "#82c91e", "#fa5252", "#4c6ef5", "#fa5252", "#4c6ef5", "#4c6ef5", "#fa5252", "#82c91e", "#fa5252", "#82c91e", "#4c6ef5", "#4c6ef5", "#fa5252", "#4c6ef5", "#82c91e", "#82c91e", "#82c91e", "#4c6ef5", "#4c6ef5", "#fa5252", "#fa5252", "#fa5252", "#82c91e", "#4c6ef5", "#fa5252", "#fa5252", "#82c91e", "#82c91e", "#82c91e", "#4c6ef5", "#4c6ef5", "#4c6ef5", "#4c6ef5", "#fa5252", "#4c6ef5", "#82c91e", "#4c6ef5", "#4c6ef5", "#4c6ef5", "#4c6ef5", "#fa5252", "#fa5252", "#4c6ef5", "#4c6ef5", "#fa5252", "#4c6ef5", "#82c91e", "#fa5252", "#fa5252", "#fa5252", "#4c6ef5", "#fa5252", "#82c91e", "#82c91e", "#fa5252", "#82c91e", "#4c6ef5", "#fa5252", "#82c91e", "#82c91e", "#4c6ef5", "#82c91e", "#4c6ef5", "#82c91e", "#fa5252", "#4c6ef5", "#82c91e", "#fa5252", "#4c6ef5", "#82c91e", "#82c91e", "#fa5252", "#82c91e", "#fa5252", "#fa5252", "#82c91e", "#4c6ef5", "#82c91e", "#4c6ef5", "#4c6ef5", "#fa5252", "#fa5252", "#fa5252", "#4c6ef5", "#82c91e", "#82c91e", "#fa5252", "#82c91e", "#82c91e", "#82c91e", "#fa5252", "#4c6ef5", "#4c6ef5", "#82c91e", "#82c91e", "#fa5252", "#4c6ef5", "#4c6ef5", "#fa5252", "#fa5252", "#fa5252", "#82c91e", "#82c91e", "#82c91e", "#82c91e", "#82c91e", "#4c6ef5", "#82c91e", "#82c91e", "#4c6ef5", "#4c6ef5", "#4c6ef5", "#4c6ef5", "#4c6ef5", "#4c6ef5", "#fa5252", "#fa5252", "#fa5252", "#fa5252", "#4c6ef5", "#fa5252", "#4c6ef5", "#4c6ef5", "#82c91e", "#82c91e", "#82c91e", "#82c91e", "#fa5252", "#4c6ef5", "#fa5252", "#fa5252", "#4c6ef5", "#82c91e", "#fa5252", "#fa5252", "#fa5252", "#fa5252", "#fa5252", "#fa5252", "#82c91e", "#4c6ef5", "#fa5252", "#fa5252"], "name": "Group 3"}]}
Relationship between Feature A and Feature B, colored by category.
The jointplot
function combines a scatter plot with histograms or KDE plots for each variable in the margins. This provides a view of both the relationship and the individual distributions simultaneously.
sns.jointplot(data=data, x='feature_A', y='feature_B', kind='scatter', color='#7048e8') # kind can be 'kde', 'hist', 'reg'
plt.suptitle('Joint Distribution of Feature A and Feature B', y=1.02) # Adjust title position
plt.show()
When dealing with more than two numerical variables, creating individual scatter plots for every pair can be tedious. Seaborn's pairplot
automates this process. It generates a grid where the diagonal shows the distribution of each variable (using histplot
or kdeplot
), and the off-diagonal cells show scatter plots for each pair of variables. This is excellent for getting a quick overview of pairwise relationships in a dataset.
# Create another numerical feature for demonstration
data['feature_C'] = data['feature_A'] * 0.5 + np.random.normal(0, 1, 200)
# Select only numerical columns for pairplot
numerical_data = data[['feature_A', 'feature_B', 'feature_C']]
sns.pairplot(numerical_data)
plt.suptitle('Pairwise Relationships Between Numerical Features', y=1.02)
plt.show()
# Pairplot colored by category
# sns.pairplot(data, hue='category', palette=['#fa5252', '#4c6ef5', '#82c91e'])
# plt.suptitle('Pairwise Relationships Colored by Category', y=1.02)
# plt.show()
Be mindful that pairplot
can become computationally intensive and visually cluttered with a large number of features.
Often, you need to compare a numerical variable across different categories or visualize the relationship between two categorical variables.
Numerical vs. Categorical: Box plots and violin plots are highly effective here. By specifying a categorical variable for the x
(or y
) axis and a numerical variable for the other, you can compare distributions side-by-side.
plt.figure(figsize=(10, 6))
sns.boxplot(data=data, x='category', y='feature_A', palette=['#f783ac', '#91a7ff', '#8ce99a'])
plt.title('Distribution of Feature A across Categories (Box Plot)')
plt.xlabel('Category')
plt.ylabel('Feature A')
plt.show()
plt.figure(figsize=(10, 6))
sns.violinplot(data=data, x='category', y='feature_B', palette=['#f783ac', '#91a7ff', '#8ce99a'])
plt.title('Distribution of Feature B across Categories (Violin Plot)')
plt.xlabel('Category')
plt.ylabel('Feature B')
plt.show()
Seaborn also offers stripplot
and swarmplot
which plot individual data points for each category, helping visualize density and overlap, especially for smaller datasets. swarmplot
adjusts point positions to avoid overlap, while stripplot
allows points to overlap (often combined with transparency or jitter
).
Categorical vs. Categorical: To understand the relationship between two categorical variables, you typically look at frequency counts. A countplot
in Seaborn can show the counts within a single categorical variable, or counts grouped by another categorical variable using the hue
parameter. For visualizing the joint frequency or a metric derived from it (like correlations calculated externally), a heatmap is often used.
Heatmaps are excellent for visualizing matrix-like data, where colors represent values. A common application in data analysis is visualizing the correlation matrix of numerical features. Pandas DataFrames have a .corr()
method to compute pairwise correlations, and Seaborn's heatmap
can display this matrix effectively.
# Calculate correlation matrix for numerical features
correlation_matrix = numerical_data.corr()
print("Correlation Matrix:")
print(correlation_matrix)
plt.figure(figsize=(7, 5))
sns.heatmap(correlation_matrix, annot=True, cmap='viridis', fmt=".2f", linewidths=.5)
# Common cmap options: 'viridis', 'plasma', 'coolwarm', 'Blues', 'Reds'
# annot=True displays the values on the cells
# fmt=".2f" formats the annotation to two decimal places
plt.title('Correlation Matrix of Numerical Features')
plt.show()
Heatmap showing the pairwise correlation coefficients between numerical features. Darker colors might indicate weaker correlations and lighter colors stronger positive correlations in this specific 'viridis' colormap.
Heatmaps quickly reveal which variables have strong positive (close to 1) or negative (close to -1) linear relationships, and which are relatively uncorrelated (close to 0).
By using these Seaborn functions, you can gain significant insights into your data's structure, identifying patterns, anomalies, and variable interactions that are essential for effective feature engineering and model building. Choose the plot type that best suits the variable types (numerical, categorical) and the specific question you are trying to answer about their distributions or relationships.
© 2025 ApX Machine Learning