Linear Regression as a Matrix Problem

One of the most common tasks in machine learning is prediction. Given a set of input features, we want to predict an output value. The simplest version of this is linear regression, where we try to find a straight line that best fits our data. At first glance, this might seem like a statistics problem, but its formulation and solution are pure linear algebra.

From a Line to a System of Equations

You might remember the equation for a line from school: $y = mx + b$ . In this equation:

$y$ is the value we want to predict (the dependent variable).
$x$ is our input feature (the independent variable).
$m$ is the slope of the line, determining its steepness.
$b$ is the y-intercept, the point where the line crosses the vertical axis.

In machine learning, we often use different notation. We might write the equation as $y = w_1x_1 + w_0$ , where $w_1$ is the weight (slope) for our feature $x_1$ , and $w_0$ is the bias term (intercept). Our goal is to find the optimal values for the weights ( $w_1$ and $w_0$ ) that make the line fit our data as closely as possible.

Let's imagine we have a small dataset for predicting house prices based on their size in square feet.

Size (sq. ft.)	Price ($1000s)
1500	300
2000	410
1200	270
1800	350

If we want our line to pass through every point perfectly, we would need to satisfy a system of equations:

$300 = w_1(1500) + w_0$ $410 = w_1(2000) + w_0$ $270 = w_1(1200) + w_0$ $350 = w_1(1800) + w_0$

The problem is immediately clear. It is very unlikely that a single line can pass through all four of these points. Data is noisy. Instead of looking for a perfect solution, we look for the line that minimizes the overall error.

Finding the red line that best represents the relationship shown by the blue data points is the goal of linear regression.

Building the Matrix Equation

This is where linear algebra provides an elegant and powerful way to represent the problem. As we saw in Chapter 4, a system of linear equations can be written in the matrix form $Ax = b$ . Let's assemble our components.

The vector $b$ (or $y$ in this context) contains our target values, the house prices.

y = \begin{bmatrix} 300 \\ 410 \\ 270 \\ 350 \end{bmatrix}

The vector $x$ contains the unknown parameters we are trying to find: the intercept $w_0$ and the slope $w_1$ .

x = \begin{bmatrix} w_0 \\ w_1 \end{bmatrix}

The matrix $A$ is the most interesting part. It's often called the "design matrix". Each row corresponds to a data point, and each column corresponds to a feature. Our equations are of the form $w_1 \cdot (\text{size}) + w_0 \cdot 1$ . So, our first column of features will be the house sizes, and our second column will represent the constant term for the intercept $w_0$ . To make the math work, we fill this second column with ones.

A = \begin{bmatrix} 1500 & 1 \\ 2000 & 1 \\ 1200 & 1 \\ 1800 & 1 \end{bmatrix}

Wait, why did we add a column of ones? Let's check what happens when we perform the matrix-vector multiplication $Ax$ :

Ax = \begin{bmatrix} 1500 & 1 \\ 2000 & 1 \\ 1200 & 1 \\ 1800 & 1 \end{bmatrix} \begin{bmatrix} w_1 \\ w_0 \end{bmatrix} = \begin{bmatrix} 1500 \cdot w_1 + 1 \cdot w_0 \\ 2000 \cdot w_1 + 1 \cdot w_0 \\ 1200 \cdot w_1 + 1 \cdot w_0 \\ 1800 \cdot w_1 + 1 \cdot w_0 \end{bmatrix}

Note: We have reordered the columns in A and x here to match the (size) * w1 + 1 * w0 format, which is a common convention.

This multiplication perfectly reconstructs our original system of equations. Our entire problem can now be stated as finding the vector $x$ that makes $Ax$ as close as possible to $y$ .

The full problem is expressed as:

Ax \approx y

Advantages of the Matrix Formulation

This representation is much more than just a notational shortcut. It has several significant advantages:

Compactness: We can represent a dataset with thousands of observations and dozens of features with a single, clean equation.
Generalization: What if we wanted to add another feature, like the number of bedrooms? We simply add another column to our matrix $A$ for that feature and another weight to our vector $x$ . The equation $Ax \approx y$ remains unchanged. This makes the approach incredibly scalable.
Computational Efficiency: Numerical libraries like NumPy are highly optimized to perform matrix operations. Solving the problem using matrix algebra is orders of magnitude faster than writing loops to iterate through equations one by one.

We have now framed linear regression as a linear algebra problem. We are looking for the vector $x$ that best solves the equation $Ax = y$ . Since a perfect solution rarely exists, the next step, which we won't solve here, involves using matrix operations like the transpose and inverse to find the vector $x$ that minimizes the error between the predicted values ( $Ax$ ) and the actual values ( $y$ ). This method is formally known as solving the "Normal Equations" and is a direct application of the tools you've learned in previous chapters.

Was this section helpful?

References

Introduction to Linear Algebra, Gilbert Strang, 2016 (Wellesley-Cambridge Press) - A classic textbook on linear algebra, providing a fundamental treatment of systems of equations, least squares, and matrix operations essential for linear regression. (5th edition)
Mathematics for Machine Learning, Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong, 2020 (Cambridge University Press) - A comprehensive textbook directly addressing the mathematical foundations of machine learning, including a dedicated chapter on linear regression using matrix algebra.
The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Trevor Hastie, Robert Tibshirani, and Jerome Friedman, 2009 (Springer) - An authoritative text covering statistical learning methods, including a detailed derivation of linear regression using matrix notation and the Normal Equations. (2nd edition)