One of the most common tasks in machine learning is prediction. Given a set of input features, we want to predict an output value. The simplest version of this is linear regression, where we try to find a straight line that best fits our data. At first glance, this might seem like a statistics problem, but its formulation and solution are pure linear algebra.
You might remember the equation for a line from school: . In this equation:
In machine learning, we often use different notation. We might write the equation as , where is the weight (slope) for our feature , and is the bias term (intercept). Our goal is to find the optimal values for the weights ( and ) that make the line fit our data as closely as possible.
Let's imagine we have a small dataset for predicting house prices based on their size in square feet.
| Size (sq. ft.) | Price ($1000s) |
|---|---|
| 1500 | 300 |
| 2000 | 410 |
| 1200 | 270 |
| 1800 | 350 |
If we want our line to pass through every point perfectly, we would need to satisfy a system of equations:
The problem is immediately clear. It is very unlikely that a single line can pass through all four of these points. Data is noisy. Instead of looking for a perfect solution, we look for the line that minimizes the overall error.
Finding the red line that best represents the relationship shown by the blue data points is the goal of linear regression.
This is where linear algebra provides an elegant and powerful way to represent the problem. As we saw in Chapter 4, a system of linear equations can be written in the matrix form . Let's assemble our components.
The vector (or in this context) contains our target values, the house prices.
The vector contains the unknown parameters we are trying to find: the intercept and the slope .
The matrix is the most interesting part. It's often called the "design matrix". Each row corresponds to a data point, and each column corresponds to a feature. Our equations are of the form . So, our first column of features will be the house sizes, and our second column will represent the constant term for the intercept . To make the math work, we fill this second column with ones.
Wait, why did we add a column of ones? Let's check what happens when we perform the matrix-vector multiplication :
Note: We have reordered the columns in A and x here to match the (size) * w1 + 1 * w0 format, which is a common convention.
This multiplication perfectly reconstructs our original system of equations. Our entire problem can now be stated as finding the vector that makes as close as possible to .
The full problem is expressed as:
This representation is much more than just a notational shortcut. It has several significant advantages:
We have now framed linear regression as a linear algebra problem. We are looking for the vector that best solves the equation . Since a perfect solution rarely exists, the next step, which we won't solve here, involves using matrix operations like the transpose and inverse to find the vector that minimizes the error between the predicted values () and the actual values (). This method is formally known as solving the "Normal Equations" and is a direct application of the tools you've learned in previous chapters.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•