While measures like mean and variance tell us about the characteristics of a single variable, we often need to understand how two or more variables relate to each other within a dataset. Does an increase in one variable tend to correspond with an increase (or decrease) in another? Correlation analysis provides a quantitative way to measure the strength and direction of a linear relationship between two quantitative variables.
The most common measure of correlation is the Pearson correlation coefficient, typically denoted by r. It quantifies the linear association between two variables, let's call them X and Y. The value of r always falls between -1 and +1, inclusive.
The formula for the sample Pearson correlation coefficient r between variables X and Y with n data points (xi,yi) is:
r=∑i=1n(xi−xˉ)2∑i=1n(yi−yˉ)2∑i=1n(xi−xˉ)(yi−yˉ)Where xˉ and yˉ are the sample means of X and Y, respectively. This formula calculates the extent to which X and Y vary together (covariance), normalized by their individual variabilities (standard deviations).
The best way to visually inspect the relationship between two quantitative variables is using a scatter plot. Each point on the plot represents a pair of values (xi,yi). The overall pattern of the points suggests the type and strength of the correlation.
Scatter plots showing examples of strong positive correlation (top, r ≈ +1), strong negative correlation (middle, r ≈ -1), and weak or no linear correlation (bottom, r ≈ 0).
In the top plot, as X increases, Y consistently increases, clustering tightly around an upward-sloping line. In the middle plot, as X increases, Y consistently decreases. In the bottom plot, there's no clear linear trend; the points are scattered without a discernible line.
While understanding the formula is helpful, you'll typically use software libraries to compute correlation coefficients. In Python, the Pandas library provides a convenient .corr()
method for DataFrames, which calculates the pairwise correlation between all columns.
import pandas as pd
# Example DataFrame
data = {'Variable_A': [1, 2, 3, 4, 5, 6],
'Variable_B': [2, 4, 5, 8, 10, 11],
'Variable_C': [10, 8, 7, 4, 2, 1]}
df = pd.DataFrame(data)
# Calculate the correlation matrix
correlation_matrix = df.corr()
print(correlation_matrix)
# Output:
# Variable_A Variable_B Variable_C
# Variable_A 1.000000 0.984916 -0.996205
# Variable_B 0.984916 1.000000 -0.963823
# Variable_C -0.996205 -0.963823 1.000000
This matrix shows the correlation coefficient for each pair of variables. For instance, the correlation between Variable_A
and Variable_B
is approximately 0.985, indicating a very strong positive linear relationship. The correlation between Variable_A
and Variable_C
is approximately -0.996, a very strong negative linear relationship. The diagonal elements are always 1, as a variable is perfectly correlated with itself.
In machine learning, correlation analysis is a fundamental step in Exploratory Data Analysis (EDA). It helps in understanding relationships between features and between features and the target variable. High correlation between input features might indicate multicollinearity, which can be problematic for some models. High correlation between a feature and the target variable suggests the feature might be predictive.
© 2025 ApX Machine Learning