In simple linear regression, our goal is to find the best straight line y=β0+β1x that describes the relationship between our predictor variable x and the target variable y. But what does "best" mean mathematically? We need a precise way to quantify how well a line fits the observed data points (x1,y1),(x2,y2),...,(xn,yn).
Imagine drawing a potential line through a scatter plot of our data. For each data point (xi,yi), the line predicts a value y^i=β0+β1xi. The difference between the actual observed value yi and the predicted value y^i is called the residual or error for that point:
ei=yi−y^i=yi−(β0+β1xi)
The residual ei represents the vertical distance between the actual data point and the line. A good line should have small residuals overall.
Simply summing the residuals (∑ei) isn't a good measure of fit. Why? Because positive residuals (points above the line) and negative residuals (points below the line) can cancel each other out. A line could have a sum of residuals equal to zero but still be a terrible fit for the data.
To overcome this cancellation issue and ensure that all errors contribute positively to our measure of misfit, we square each residual before summing them up. This gives us the Sum of Squared Residuals (SSR), also known as the Residual Sum of Squares (RSS) or Sum of Squared Errors (SSE):
SSR=∑i=1nei2=∑i=1n(yi−y^i)2=∑i=1n(yi−(β0+β1xi))2
The Method of Least Squares is the principle that defines the "best" fitting line as the one that minimizes this Sum of Squared Residuals (SSR). We want to find the specific values for the intercept (β0) and the slope (β1) that make the sum of the squared vertical distances from the data points to the line as small as possible.
How do we find these optimal values, typically denoted as β^0 and β^1? This involves using calculus. We treat the SSR as a function of β0 and β1. We take the partial derivative of the SSR with respect to β0 and the partial derivative with respect to β1, set both derivatives equal to zero, and solve the resulting system of two linear equations (often called the "normal equations"). While we won't work through the full derivation here, the solution provides formulas for the estimated coefficients:
The estimated slope (β^1) is calculated as:
β^1=∑i=1n(xi−xˉ)2∑i=1n(xi−xˉ)(yi−yˉ)
Here, xˉ is the mean of the predictor variable values, and yˉ is the mean of the target variable values. Notice that the numerator is proportional to the sample covariance between x and y, and the denominator is proportional to the sample variance of x. Intuitively, the slope reflects how much y tends to change (covariance) relative to how much x changes (variance).
Once we have the estimated slope β^1, the estimated intercept (β^0) is calculated simply as:
β^0=yˉ−β^1xˉ
This formula ensures that the regression line passes through the point (xˉ,yˉ), the center of mass of the data.
Let's visualize this. Imagine the scatter plot of data points. The least squares method finds the unique line that minimizes the sum of the areas of the squares whose sides are the residuals.
{"data": [{"x": [1, 2, 3, 4, 5, 6, 7, 8], "y": [2.5, 3.1, 4.5, 4.9, 6.2, 6.8, 8.1, 8.5], "mode": "markers", "type": "scatter", "name": "Data Points", "marker": {"color": "#339af0", "size": 8}}, {"x": [1, 8], "y": [2.5 + 0 * 1 + 0.85 * 1, 2.5 + 0 * 8 + 0.85 * 8], "mode": "lines", "type": "scatter", "name": "Least Squares Line", "line": {"color": "#f03e3e", "width": 2}}, {"x": [1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8], "y": [2.5, 3.35, 3.1, 4.2, 4.5, 5.05, 4.9, 5.9, 6.2, 6.75, 6.8, 7.6, 8.1, 8.45, 8.5, 9.3], "mode": "lines", "type": "scatter", "name": "Residuals", "line": {"color": "#adb5bd", "width": 1, "dash": "dot"}, "showlegend": False}], "layout": {"title": "Least Squares Regression Line and Residuals", "xaxis": {"title": "Predictor Variable (x)"}, "yaxis": {"title": "Target Variable (y)", "range": [0, 10]}, "shapes": [{"type": "rect", "x0": 1 - 0.425, "y0": 2.5, "x1": 1 + 0.425, "y1": 3.35, "line": {"color": "#adb5bd"}, "fillcolor": "#e9ecef", "opacity": 0.4}, {"type": "rect", "x0": 2 - 0.55, "y0": 3.1, "x1": 2 + 0.55, "y1": 4.2, "line": {"color": "#adb5bd"}, "fillcolor": "#e9ecef", "opacity": 0.4}, {"type": "rect", "x0": 3 - 0.275, "y0": 4.5, "x1": 3 + 0.275, "y1": 5.05, "line": {"color": "#adb5bd"}, "fillcolor": "#e9ecef", "opacity": 0.4}, {"type": "rect", "x0": 4 - 0.5, "y0": 4.9, "x1": 4 + 0.5, "y1": 5.9, "line": {"color": "#adb5bd"}, "fillcolor": "#e9ecef", "opacity": 0.4}, {"type": "rect", "x0": 5 - 0.275, "y0": 6.2, "x1": 5 + 0.275, "y1": 6.75, "line": {"color": "#adb5bd"}, "fillcolor": "#e9ecef", "opacity": 0.4}, {"type": "rect", "x0": 6 - 0.4, "y0": 6.8, "x1": 6 + 0.4, "y1": 7.6, "line": {"color": "#adb5bd"}, "fillcolor": "#e9ecef", "opacity": 0.4}, {"type": "rect", "x0": 7 - 0.175, "y0": 8.1, "x1": 7 + 0.175, "y1": 8.45, "line": {"color": "#adb5bd"}, "fillcolor": "#e9ecef", "opacity": 0.4}, {"type": "rect", "x0": 8 - 0.4, "y0": 8.5, "x1": 8 + 0.4, "y1": 9.3, "line": {"color": "#adb5bd"}, "fillcolor": "#e9ecef", "opacity": 0.4}], "legend": {"yanchor": "top", "y": 0.99, "xanchor": "left", "x": 0.01}}}
Data points (blue), the calculated least squares regression line (red), and the residuals (vertical gray lines). The method minimizes the sum of the areas of the gray squares formed by these residuals.
Calculating these sums and formulas manually can be tedious for large datasets. Fortunately, statistical software and libraries in programming languages like Python handle these computations efficiently. Libraries such as Scikit-learn and Statsmodels provide functions that implement the method of least squares (and more advanced techniques) to estimate regression coefficients directly from your data, allowing you to focus on interpreting the results and evaluating the model. We will see how to use these libraries later in the chapter.
© 2025 ApX Machine Learning