Regression analysis helps us understand and quantify the relationship between variables. The simplest starting point is when we want to model how a single variable, the dependent or response variable, changes in response to another single variable, the independent or predictor variable. This is the domain of simple linear regression (SLR).
Imagine you have data on two variables, say, years of experience (x) and salary (y). You might suspect that as experience increases, salary tends to increase as well. Simple linear regression provides a formal way to model this suspected linear relationship.
At its core, simple linear regression assumes that the relationship between the independent variable x and the dependent variable y can be approximated by a straight line. However, real-world data rarely falls perfectly on a line. There's almost always some scatter or variability. To account for this, the theoretical model for simple linear regression is written as:
y=β0+β1x+ϵ
Let's break down this equation:
Think of the β0+β1x part as the deterministic, linear component of the relationship, and ϵ as the random, unexplained component.
The equation y=β0+β1x+ϵ describes the theoretical relationship in the entire population. In practice, we rarely have access to population data. Instead, we work with a sample drawn from the population. Our goal is to use the sample data to estimate the unknown population parameters β0 and β1.
We denote the estimates calculated from the sample data as b0 (or sometimes β^0) and b1 (or β^1). The estimated regression line based on the sample is then:
y^=b0+b1x
Here, y^ (read "y-hat") represents the predicted value of y for a given value of x, based on our sample estimates. The difference between an observed value yi in our sample and its corresponding predicted value y^i is the sample residual, ei=yi−y^i. These sample residuals ei are our observable stand-ins for the unobservable theoretical errors ϵi.
The challenge, which we'll address in the next section, is how to find the "best" values for b0 and b1 based on our sample data points (xi,yi).
A scatter plot is the ideal way to visualize the data and the potential linear relationship before fitting a model. Simple linear regression essentially tries to find the line that best cuts through the cloud of points on this scatter plot.
A scatter plot showing individual data points (blue dots) and a potential line (red dashed) representing the simple linear regression model y^=b0+b1x. The goal is to find the line that minimizes the overall distance between the line and the points.
Understanding this basic model structure is fundamental. It not only allows us to model simple relationships but also serves as the foundation for more complex regression techniques, including multiple linear regression (using multiple predictors) and polynomial regression (modeling curves), which are frequently used in machine learning.
© 2025 ApX Machine Learning