Okay, we've learned how to fit a line to data using least squares and evaluate how well it represents the relationship using metrics like R2. But how much faith can we put in the coefficients we estimated (β^0,β^1) and the predictions our model makes? The simple linear regression model, y=β0+β1x+ϵ, relies on several assumptions, primarily about the error term ϵ. These assumptions are important because they underpin the statistical tests and confidence intervals associated with the model. If these assumptions don't hold, our inferences might be misleading.
Let's examine the standard assumptions for simple linear regression:
Linearity
This is the most fundamental assumption. It states that the relationship between the predictor variable x and the mean of the outcome variable y is linear. Mathematically, this means E[Y∣X=x]=β0+β1x.
Why it matters: If the true relationship is non-linear, fitting a straight line will result in a poor model that systematically over or under predicts in different ranges of x.
How to check:
Scatter plot: Plot y against x. Look for a generally linear pattern.
Residual plot: Plot the residuals (ei=yi−y^i) against the predictor variable x (or fitted values y^i). If the linearity assumption holds, the residuals should be randomly scattered around the horizontal line at 0. A discernible pattern (like a curve) suggests a violation.
What if it's violated?: Consider transforming variables (e.g., log(y), x) or using more complex models like polynomial regression or other non-linear techniques.
Independence of Errors
This assumption states that the errors (ϵi) are independent of each other. In other words, the error associated with one observation does not provide information about the error associated with another observation.
Why it matters: Standard error calculations and hypothesis tests for the coefficients assume independence. Correlated errors often occur in time series data (where an observation at one point in time might be related to the next) or clustered data.
How to check:
Residual plot (vs. time or order): If data has a time sequence or order, plot residuals against the sequence. Look for patterns like runs of positive or negative residuals, or cyclical behavior. Random scatter supports independence.
Context: Consider how the data was collected. Does it make sense for observations to be related?
Statistical tests: Tests like the Durbin-Watson test can formally check for autocorrelation (common in time series).
What if it's violated?: Standard errors will likely be underestimated, leading to overly narrow confidence intervals and potentially incorrect conclusions from hypothesis tests (e.g., finding a variable significant when it isn't). Time series models or methods accounting for data structure might be needed.
Normality of Errors
The errors (ϵi) are assumed to be normally distributed with a mean of zero.
Why it matters: This assumption is primarily required for the validity of hypothesis tests (like t-tests on coefficients) and the construction of confidence intervals, especially with smaller sample sizes. The least squares estimation itself doesn't strictly require normality, but the inference procedures do.
How to check:
Histogram of residuals: Check if the distribution of residuals looks roughly bell-shaped and centered around zero.
Q-Q Plot (Quantile-Quantile Plot): This plot compares the quantiles of the residuals to the quantiles of a theoretical normal distribution. If residuals are normally distributed, the points should fall approximately along a straight diagonal line. Deviations suggest non-normality.
Example Q-Q plot suggesting approximate normality (points near the dashed line).
Formal tests: Tests like Shapiro-Wilk or Kolmogorov-Smirnov can be used, but visual inspection is often more informative, especially regarding the type of deviation.
What if it's violated?: P-values and confidence intervals may be inaccurate. However, for large sample sizes, the Central Limit Theorem often ensures that the sampling distribution of the estimated coefficients is approximately normal, even if the errors themselves are not. Still, severe non-normality can be problematic. Transformations of y might help.
Homoscedasticity (Constant Variance of Errors)
This assumption, also known as homogeneity of variance, states that the variance of the errors (ϵi) is constant across all levels of the predictor variable x. That is, Var(ϵi)=σ2 for all i. The opposite is heteroscedasticity, where the error variance changes with x.
Why it matters: Least squares gives equal weight to all observations. If the variance differs (e.g., predictions are much more spread out for larger values of x), the standard errors for the coefficients become unreliable, affecting hypothesis tests and confidence intervals. The coefficient estimates themselves remain unbiased, but they are no longer the most efficient (minimum variance) among unbiased estimators.
How to check:
Residual plot (vs. fitted values or predictor): Plot residuals (ei) against the fitted values (y^i) or the predictor (xi). Look for a consistent spread of points around the zero line. A funnel shape (variance increasing or decreasing with fitted values/predictor) indicates heteroscedasticity.
Formal tests: Tests like Breusch-Pagan or White test can formally check for heteroscedasticity.
What if it's violated?: Use transformations (e.g., log transform y if variance increases with the mean), use weighted least squares (WLS) where observations with smaller variance get more weight, or use robust standard errors (Huber-White standard errors) which adjust for heteroscedasticity.
Checking these assumptions, often through graphical analysis of residuals, is a significant part of the regression modeling process. While no real-world data perfectly satisfies all assumptions, understanding potential violations helps you interpret your model results cautiously and choose appropriate remedies or alternative modeling techniques when necessary. Libraries like Statsmodels in Python provide tools for generating these diagnostic plots.