All Courses

Hands-on Practical: Bivariate Exploration

Now that we've covered the theory behind exploring relationships between pairs of variables, let's put these techniques into practice. This section provides hands-on examples using Python libraries like Pandas, Matplotlib, and Seaborn to perform bivariate analysis on a sample dataset. We assume you have a Pandas DataFrame loaded, perhaps from the steps outlined in Chapter 2.

For our examples, let's imagine we have a DataFrame named df containing customer information with columns such as Age, Annual_Income_k$, Spending_Score (a score from 1-100 assigned based on spending behavior), Gender, and Education.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Assume 'df' is your loaded DataFrame
# Example: df = pd.read_csv('customer_data.csv')

# Let's remind ourselves of the data structure
print("Data Sample:")
print(df.head())
print("\nData Info:")
df.info()

# Set plot style for consistency
sns.set_theme(style="whitegrid")

The output of df.head() and df.info() would confirm the column names and their data types, setting the stage for our analysis.

Analyzing Numerical vs. Numerical Relationships

A common task is to understand how two quantitative measures relate to each other. Does income increase with age? Is there a connection between income and spending score?

Scatter Plots

Scatter plots are excellent for visualizing the relationship between two numerical variables. Let's examine the relationship between Annual_Income_k$ and Spending_Score.

plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='Annual_Income_k$', y='Spending_Score', hue='Gender', palette=['#1c7ed6', '#f06595'])
plt.title('Annual Income vs. Spending Score')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.show()

Scatter plot showing the relationship between customer annual income and their spending score. Points are colored by gender.

This plot helps us visually inspect the distribution. We might observe clusters, such as customers with low income and low spending scores, or high income and high spending scores. Adding the hue for gender allows us to see if the relationship differs between groups.

Correlation Analysis

While scatter plots show the relationship visually, correlation coefficients quantify the linear association between numerical variables. Pearson's correlation coefficient ( $r$ ) ranges from -1 (perfect negative linear correlation) to +1 (perfect positive linear correlation), with 0 indicating no linear correlation.

Let's calculate the correlation between the numerical columns in our DataFrame.

numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns
correlation_matrix = df[numerical_cols].corr()

print("Correlation Matrix:")
print(correlation_matrix)

This will output a table showing the pairwise correlation coefficients. For instance, we might find a weak positive correlation between Age and Annual_Income_k$, or a moderate negative correlation between Age and Spending_Score.

Correlation Heatmap

A heatmap provides a graphical representation of the correlation matrix, making it easier to spot strong or weak correlations, especially when many variables are involved.

plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix of Numerical Features')
plt.show()

Heatmap visualizing the Pearson correlation coefficients between Age, Annual Income, and Spending Score. Annotations show the correlation values.

The colors indicate the strength and direction of the correlation (e.g., red for positive, blue for negative), and the annotations provide the exact values.

Analyzing Numerical vs. Categorical Relationships

Often, we need to compare a numerical measure across different groups defined by a categorical variable. For example, does the average Annual_Income_k$ differ significantly between Gender groups or across Education levels?

Comparative Plots (Box Plots, Violin Plots)

Box plots and violin plots are effective for visualizing the distribution of a numerical variable for each category of a categorical variable.

Let's compare Annual_Income_k$ across different Education levels using a box plot.

plt.figure(figsize=(12, 7))
sns.boxplot(data=df, x='Education', y='Annual_Income_k$', palette='Blues')
plt.title('Annual Income Distribution by Education Level')
plt.xlabel('Education Level')
plt.ylabel('Annual Income (k$)')
plt.xticks(rotation=45)
plt.show()

Box plots comparing the distribution of Annual Income for different Education Levels.

This plot shows the median (central line), interquartile range (IQR, the box), and potential outliers (points outside the whiskers) for income within each education category. We can visually assess if the central tendency and spread of income differ across these groups. A violin plot (sns.violinplot) could be used similarly, adding information about the density shape of the distribution.

Analyzing Categorical vs. Categorical Relationships

To understand if there's an association between two categorical variables, we can use frequency counts and visualizations. For example, is there a relationship between Gender and Education level in our customer base?

Cross-Tabulation

A cross-tabulation (or contingency table) shows the frequency distribution of one categorical variable against another. Pandas provides the crosstab function for this.

cross_tab = pd.crosstab(df['Gender'], df['Education'])

print("Cross-Tabulation: Gender vs Education")
print(cross_tab)

This outputs a table where rows represent one category (e.g., Gender) and columns represent the other (e.g., Education), with the cells containing the counts of occurrences for each combination. You can also display proportions by using the normalize argument in pd.crosstab.

Stacked or Grouped Bar Charts

Visualizing the cross-tabulation often makes the relationship clearer. A grouped or stacked bar chart is suitable for this. We can use Seaborn's countplot with the hue parameter.

plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='Education', hue='Gender', palette=['#1c7ed6', '#f06595'])
plt.title('Education Level Distribution by Gender')
plt.xlabel('Education Level')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.legend(title='Gender')
plt.show()

Grouped bar chart showing the count of customers for each Education Level, separated by Gender.

This chart allows for a direct comparison of counts across categories. For example, we can see if the proportion of individuals with a 'Master' degree differs between 'Male' and 'Female' customers in our dataset. A stacked bar chart (achieved by setting dodge=False in sns.countplot or using kind='bar', stacked=True with Pandas plotting) could alternatively show the relative proportions within each education level.

By applying these techniques, scatter plots, correlation analysis, comparative plots, cross-tabulations, and categorical bar charts, you can systematically investigate the relationships between pairs of variables in your dataset. These explorations provide valuable insights into the underlying structure of your data and are an essential part of the EDA process.

Was this section helpful?