After exploring individual variables through univariate analysis, our focus shifts to examining the interplay between pairs of variables. This bivariate analysis is essential for uncovering connections and patterns that involve more than one feature in your dataset.
One of the most common and informative scenarios in bivariate analysis involves examining the relationship between two numerical variables. For instance, how does engine displacement relate to fuel efficiency? Or how does advertising spend correlate with sales? The primary visual tool for investigating such relationships is the scatter plot.
A scatter plot displays individual data points on a two-dimensional graph. Each point's position is determined by the values of two selected numerical variables: one variable dictates the position on the horizontal axis (x-axis), and the other dictates the position on the vertical axis (y-axis). This visualization allows us to directly observe the structure, direction, and strength of the association between the two variables.
Python libraries like Matplotlib and Seaborn provide convenient functions for generating scatter plots. Seaborn's scatterplot
function is particularly useful as it integrates smoothly with Pandas DataFrames.
Let's start with a simple example. We'll generate some synthetic data where two variables have a roughly linear relationship and then plot them using Seaborn.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Generate sample data
np.random.seed(42) # for reproducibility
x_data = np.random.rand(100) * 10
y_data = 2.5 * x_data + np.random.randn(100) * 5 # y = 2.5x + noise
# Create a DataFrame
df = pd.DataFrame({'Variable_X': x_data, 'Variable_Y': y_data})
# Create the scatter plot
plt.figure(figsize=(8, 5)) # Set the figure size
sns.scatterplot(data=df, x='Variable_X', y='Variable_Y')
# Add labels and title for clarity
plt.xlabel("Variable X")
plt.ylabel("Variable Y")
plt.title("Scatter Plot of Variable Y vs. Variable X")
# Display the plot
plt.show()
This code first creates two NumPy arrays, x_data
and y_data
, where y_data
is linearly dependent on x_data
with some added random noise. These are then put into a Pandas DataFrame. Finally, sns.scatterplot
is called with the DataFrame and the column names for the x and y axes. We also add labels and a title using Matplotlib functions for better interpretation.
When examining a scatter plot, look for these key characteristics:
Here are visual examples of different patterns:
A clear upward trend indicates a positive linear association.
A clear downward trend indicates a negative linear association.
Points are scattered randomly, suggesting little to no linear association between X and Y.
s
parameter in scatterplot
or scatter
).alpha
parameter (e.g., alpha=0.5
).plt.xlabel
, plt.ylabel
) and provide an informative title (plt.title
). This is fundamental for communicating your findings.scatterplot
allows this using the hue
(color), size
, or style
parameters to differentiate points based on the third variable's values.Scatter plots provide an invaluable first look at the potential relationship between two numerical variables. They visually summarize the association's direction, form, and strength, guiding further quantitative analysis, such as calculating correlation coefficients, which we will discuss next.
© 2025 ApX Machine Learning