While scatter plots give us a visual sense of the relationship between two numerical variables, we often need a quantitative measure to describe the strength and direction of this association. Correlation analysis provides precisely this: a statistical metric summarizing how closely two variables move together.
A correlation coefficient is a numerical value ranging from −1 to +1. It tells us two things about the relationship between two variables:
The most commonly used correlation coefficient is Pearson's product-moment correlation coefficient, often denoted as r. It specifically measures the strength and direction of a linear relationship between two continuous variables.
The value of Pearson's r can be interpreted as follows:
Pearson's r assumes that the variables are approximately normally distributed and that the relationship between them is linear. It can also be sensitive to outliers.
Pandas makes calculating Pearson's correlation straightforward using the .corr()
method on a DataFrame. By default, it computes the pairwise correlation of columns.
import pandas as pd
# Sample DataFrame
data = {'Temperature': [20, 22, 25, 18, 23, 28],
'Ice_Cream_Sales': [150, 170, 200, 130, 180, 230],
'Umbrella_Sales': [50, 45, 30, 60, 40, 20]}
df = pd.DataFrame(data)
# Calculate the pairwise Pearson correlation matrix
correlation_matrix = df.corr(method='pearson')
print(correlation_matrix)
This will output a matrix where each cell (i,j) contains the Pearson correlation coefficient between column i and column j. The diagonal elements will always be 1, as a variable is perfectly correlated with itself.
Temperature Ice_Cream_Sales Umbrella_Sales
Temperature 1.000000 0.993399 -0.976701
Ice_Cream_Sales 0.993399 1.000000 -0.954676
Umbrella_Sales -0.976701 -0.954676 1.000000
From this output, we see a strong positive correlation (r≈0.99) between Temperature
and Ice_Cream_Sales
, and a strong negative correlation (r≈−0.98) between Temperature
and Umbrella_Sales
.
To calculate the correlation between just two specific columns (Series), you can use the .corr()
method on one Series and pass the other as an argument:
# Correlation between Temperature and Ice_Cream_Sales
temp_sales_corr = df['Temperature'].corr(df['Ice_Cream_Sales'])
print(f"Correlation between Temperature and Ice Cream Sales: {temp_sales_corr:.4f}")
# Output: Correlation between Temperature and Ice Cream Sales: 0.9934
What if the relationship between variables isn't linear, but still generally increasing or decreasing (monotonic)? Or what if the data includes significant outliers or doesn't follow a normal distribution? In such cases, Pearson's r might not be the best measure.
Spearman's rank correlation coefficient, denoted as ρ (rho), is a non-parametric alternative. It assesses how well the relationship between two variables can be described using a monotonic function. Instead of using the actual values, Spearman's correlation calculates Pearson's correlation on the ranks of the data.
It's less sensitive to outliers than Pearson's r and doesn't assume linearity or normality.
You can calculate Spearman's correlation by specifying method='spearman'
in the .corr()
method:
# Calculate the pairwise Spearman correlation matrix
spearman_corr_matrix = df.corr(method='spearman')
print(spearman_corr_matrix)
Temperature Ice_Cream_Sales Umbrella_Sales
Temperature 1.000000 1.000000 -0.942857
Ice_Cream_Sales 1.000000 1.000000 -0.942857
Umbrella_Sales -0.942857 -0.942857 1.000000
Notice the values might differ slightly from Pearson's, especially if non-linear monotonic relationships or outliers are present. In this specific example, the relationship is very close to linear, so the Spearman results are similar but not identical to Pearson's for the Umbrella_Sales
correlations.
Correlation analysis provides a compact numerical summary of the relationship between pairs of numerical variables, complementing the visual insights gained from scatter plots and guiding further investigation or feature selection.
© 2025 ApX Machine Learning