Alright, let's dive into the first major type of supervised learning task: regression. If you recall from Chapter 1, supervised learning involves training a model on data where we already know the "correct" answers. In regression, that correct answer is a continuous numerical value.
Think about predicting quantities, amounts, or sizes. The goal isn't to assign an item to a category (like "spam" or "not spam"), but rather to estimate a specific number along a continuous scale.
A machine learning task is considered a regression problem when the primary objective is to predict a continuous output variable. This output variable is often called the target, dependent variable, or label. The inputs used to make this prediction are called features, independent variables, or predictors.
The core idea is to learn a mapping or relationship between the input features and the continuous output target. We use a dataset containing examples where we have both the input features and the known, correct output value. The machine learning algorithm studies these examples to figure out the underlying pattern, allowing it to predict the output for new, unseen input features.
Consider these common scenarios where regression is applied:
In all these cases, the variable we want to predict can, in principle, take on any value within a given range.
Imagine you have data relating the size of different houses to their prices. If you plot this data, you might see something like this:
A scatter plot showing hypothetical data points for house size versus price. Larger houses generally tend to have higher prices, but it's not a perfectly straight line.
In a regression task, our goal is to learn a model, often visualized as a line or curve, that best captures the trend in this data. For instance, we might try to fit a straight line through these points:
The same scatter plot with a potential regression line added. The line represents the model's learned relationship between size and price.
This fitted line represents the model's understanding of the relationship. Once we have this line (or a more complex model), we can use it to predict the price for a new house size we haven't seen before. For example, if someone asks about a 1500 sq ft house, we can find the corresponding point on the line to estimate its price.
Regression fits squarely into the supervised learning framework we discussed earlier:
Formally, we are trying to learn a function, let's call it f, that takes the input features X and produces an output value y^ (read "y-hat") which is our prediction for the true value y. The goal is to make y^ as close to y as possible, on average. So, we seek:
y^=f(X)
such that y^ is a good approximation of the true y.
The key takeaway is that regression deals with predicting quantities on a continuous spectrum. It's about answering "how much?" or "how many?" rather than "which category?". This distinguishes it from classification, the other main type of supervised learning, which focuses on assigning discrete labels. In the following sections, we will explore Linear Regression, a foundational algorithm for tackling these kinds of prediction problems.
© 2025 ApX Machine Learning