While simple linear regression helps us understand the relationship between a single predictor variable and a response, many real-world phenomena are influenced by multiple factors simultaneously. Predicting house prices, for example, likely involves considering not just square footage but also the number of bedrooms, location, age of the house, and so on. Multiple linear regression extends the concepts we've learned to handle these situations with multiple predictor variables.
The core idea is to expand the simple linear regression equation to include terms for each predictor variable. If we have p predictor variables, x1,x2,...,xp, the multiple linear regression model is defined as:
y=β0+β1x1+β2x2+...+βpxp+ϵLet's break down the components:
This model represents a linear relationship between the predictors and the response. While simple linear regression describes a line, multiple linear regression with two predictors describes a plane, and with more than two predictors, it describes a hyperplane in higher-dimensional space.
A multiple linear regression model relates multiple predictor variables (x1,x2,...,xp) to a single response variable (y) through estimated coefficients (β1,...,βp) and an intercept (β0).
The interpretation of coefficients in multiple regression requires careful consideration. Each coefficient, βj, represents the expected change in the response variable y for a one-unit increase in the corresponding predictor variable xj, assuming all other predictor variables in the model are held constant.
This "holding other variables constant" aspect is important. The value of β1 in a multiple regression model y=β0+β1x1+β2x2+ϵ might be different from the coefficient for x1 in a simple linear regression y=β0+β1x1+ϵ. This is because the multiple regression coefficient accounts for the influence of x2 on y, isolating the unique contribution of x1.
Just like in simple linear regression, the coefficients (β0,β1,...,βp) are typically estimated using the method of least squares. The goal remains the same: find the coefficients that minimize the sum of the squared differences between the observed values (yi) and the values predicted by the model (y^i). While the mathematical calculations involve matrix algebra (especially when p>1), the basis is identical. Fortunately, libraries like Scikit-learn and Statsmodels handle these computations for us.
Model evaluation also uses familiar metrics:
However, R2 has a limitation in the multiple regression context: it always increases or stays the same when you add more predictors to the model, even if those predictors are irrelevant. This can be misleading.
To address this, we often use Adjusted R-squared. This metric modifies R2 by penalizing the inclusion of extra predictors that do not significantly improve the model's fit. Adjusted R2 provides a more honest assessment when comparing models with different numbers of predictors. It increases only if the added variable improves the model more than would be expected by chance.
The assumptions underlying simple linear regression generally extend to multiple linear regression:
In addition, multiple regression introduces a new potential issue:
Building effective multiple regression models often involves selecting the most relevant predictors (feature selection), checking for interactions between predictors (where the effect of one predictor depends on the level of another), and validating the model assumptions.
This overview sets the stage for applying these concepts. In practice, you'll use software tools to fit these models, interpret their output, and diagnose potential problems, allowing you to build more comprehensive predictive models from your data.
© 2025 ApX Machine Learning