One of the most common tasks in machine learning is prediction. Given a set of input features, we want to predict an output value. The simplest version of this is linear regression, where we try to find a straight line that best fits our data. At first glance, this might seem like a statistics problem, but its formulation and solution are pure linear algebra.
You might remember the equation for a line from school: y=mx+b. In this equation:
In machine learning, we often use different notation. We might write the equation as y=w1x1+w0, where w1 is the weight (slope) for our feature x1, and w0 is the bias term (intercept). Our goal is to find the optimal values for the weights (w1 and w0) that make the line fit our data as closely as possible.
Let's imagine we have a small dataset for predicting house prices based on their size in square feet.
| Size (sq. ft.) | Price ($1000s) |
|---|---|
| 1500 | 300 |
| 2000 | 410 |
| 1200 | 270 |
| 1800 | 350 |
If we want our line to pass through every point perfectly, we would need to satisfy a system of equations:
300=w1(1500)+w0 410=w1(2000)+w0 270=w1(1200)+w0 350=w1(1800)+w0
The problem is immediately clear. It is very unlikely that a single line can pass through all four of these points. Data is noisy. Instead of looking for a perfect solution, we look for the line that minimizes the overall error.
Finding the red line that best represents the relationship shown by the blue data points is the goal of linear regression.
This is where linear algebra provides an elegant and powerful way to represent the problem. As we saw in Chapter 4, a system of linear equations can be written in the matrix form Ax=b. Let's assemble our components.
The vector b (or y in this context) contains our target values, the house prices.
y=300410270350The vector x contains the unknown parameters we are trying to find: the intercept w0 and the slope w1.
x=[w0w1]The matrix A is the most interesting part. It's often called the "design matrix". Each row corresponds to a data point, and each column corresponds to a feature. Our equations are of the form w1⋅(size)+w0⋅1. So, our first column of features will be the house sizes, and our second column will represent the constant term for the intercept w0. To make the math work, we fill this second column with ones.
A=15002000120018001111Wait, why did we add a column of ones? Let's check what happens when we perform the matrix-vector multiplication Ax:
Ax=15002000120018001111[w1w0]=1500⋅w1+1⋅w02000⋅w1+1⋅w01200⋅w1+1⋅w01800⋅w1+1⋅w0Note: We have reordered the columns in A and x here to match the (size) * w1 + 1 * w0 format, which is a common convention.
This multiplication perfectly reconstructs our original system of equations. Our entire problem can now be stated as finding the vector x that makes Ax as close as possible to y.
The full problem is expressed as:
Ax≈yThis representation is much more than just a notational shortcut. It has several significant advantages:
We have now framed linear regression as a linear algebra problem. We are looking for the vector x that best solves the equation Ax=y. Since a perfect solution rarely exists, the next step, which we won't solve here, involves using matrix operations like the transpose and inverse to find the vector x that minimizes the error between the predicted values (Ax) and the actual values (y). This method is formally known as solving the "Normal Equations" and is a direct application of the tools you've learned in previous chapters.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with