Now that we've covered the theory behind exploring relationships between pairs of variables, let's put these techniques into practice. This section provides hands-on examples using Python libraries like Pandas, Matplotlib, and Seaborn to perform bivariate analysis on a sample dataset. We assume you have a Pandas DataFrame loaded, perhaps from the steps outlined in Chapter 2.
For our examples, let's imagine we have a DataFrame named df
containing customer information with columns such as Age
, Annual_Income_k$
, Spending_Score
(a score from 1-100 assigned based on spending behavior), Gender
, and Education
.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Assume 'df' is your loaded DataFrame
# Example: df = pd.read_csv('customer_data.csv')
# Let's remind ourselves of the data structure
print("Data Sample:")
print(df.head())
print("\nData Info:")
df.info()
# Set plot style for consistency
sns.set_theme(style="whitegrid")
The output of df.head()
and df.info()
would confirm the column names and their data types, setting the stage for our analysis.
A common task is to understand how two quantitative measures relate to each other. Does income increase with age? Is there a connection between income and spending score?
Scatter plots are excellent for visualizing the relationship between two numerical variables. Let's examine the relationship between Annual_Income_k$
and Spending_Score
.
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='Annual_Income_k$', y='Spending_Score', hue='Gender', palette=['#1c7ed6', '#f06595'])
plt.title('Annual Income vs. Spending Score')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.show()
Scatter plot showing the relationship between customer annual income and their spending score. Points are colored by gender.
This plot helps us visually inspect the distribution. We might observe clusters, such as customers with low income and low spending scores, or high income and high spending scores. Adding the hue
for gender allows us to see if the relationship differs between groups.
While scatter plots show the relationship visually, correlation coefficients quantify the linear association between numerical variables. Pearson's correlation coefficient (r) ranges from -1 (perfect negative linear correlation) to +1 (perfect positive linear correlation), with 0 indicating no linear correlation.
Let's calculate the correlation between the numerical columns in our DataFrame.
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns
correlation_matrix = df[numerical_cols].corr()
print("Correlation Matrix:")
print(correlation_matrix)
This will output a table showing the pairwise correlation coefficients. For instance, we might find a weak positive correlation between Age
and Annual_Income_k$
, or a moderate negative correlation between Age
and Spending_Score
.
A heatmap provides a graphical representation of the correlation matrix, making it easier to spot strong or weak correlations, especially when many variables are involved.
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix of Numerical Features')
plt.show()
Heatmap visualizing the Pearson correlation coefficients between Age, Annual Income, and Spending Score. Annotations show the correlation values.
The colors indicate the strength and direction of the correlation (e.g., red for positive, blue for negative), and the annotations provide the exact values.
Often, we need to compare a numerical measure across different groups defined by a categorical variable. For example, does the average Annual_Income_k$
differ significantly between Gender
groups or across Education
levels?
Box plots and violin plots are effective for visualizing the distribution of a numerical variable for each category of a categorical variable.
Let's compare Annual_Income_k$
across different Education
levels using a box plot.
plt.figure(figsize=(12, 7))
sns.boxplot(data=df, x='Education', y='Annual_Income_k$', palette='Blues')
plt.title('Annual Income Distribution by Education Level')
plt.xlabel('Education Level')
plt.ylabel('Annual Income (k$)')
plt.xticks(rotation=45)
plt.show()
Box plots comparing the distribution of Annual Income for different Education Levels.
This plot shows the median (central line), interquartile range (IQR, the box), and potential outliers (points beyond the whiskers) for income within each education category. We can visually assess if the central tendency and spread of income differ across these groups. A violin plot (sns.violinplot
) could be used similarly, adding information about the density shape of the distribution.
To understand if there's an association between two categorical variables, we can use frequency counts and visualizations. For example, is there a relationship between Gender
and Education
level in our customer base?
A cross-tabulation (or contingency table) shows the frequency distribution of one categorical variable against another. Pandas provides the crosstab
function for this.
cross_tab = pd.crosstab(df['Gender'], df['Education'])
print("Cross-Tabulation: Gender vs Education")
print(cross_tab)
This outputs a table where rows represent one category (e.g., Gender) and columns represent the other (e.g., Education), with the cells containing the counts of occurrences for each combination. You can also display proportions by using the normalize
argument in pd.crosstab
.
Visualizing the cross-tabulation often makes the relationship clearer. A grouped or stacked bar chart is suitable for this. We can use Seaborn's countplot
with the hue
parameter.
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='Education', hue='Gender', palette=['#1c7ed6', '#f06595'])
plt.title('Education Level Distribution by Gender')
plt.xlabel('Education Level')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.legend(title='Gender')
plt.show()
Grouped bar chart showing the count of customers for each Education Level, separated by Gender.
This chart allows for a direct comparison of counts across categories. For example, we can see if the proportion of individuals with a 'Master' degree differs between 'Male' and 'Female' customers in our dataset. A stacked bar chart (achieved by setting dodge=False
in sns.countplot
or using kind='bar', stacked=True
with Pandas plotting) could alternatively show the relative proportions within each education level.
By applying these techniques, scatter plots, correlation analysis, comparative plots, cross-tabulations, and categorical bar charts, you can systematically investigate the relationships between pairs of variables in your dataset. These explorations provide valuable insights into the underlying structure of your data and are an essential part of the EDA process.
© 2025 ApX Machine Learning