In supervised machine learning, our goal is often to learn a mapping from input features to an output variable. When the output variable we are trying to predict is a continuous numerical value, we are dealing with a regression problem. This contrasts with classification problems, covered in the next chapter, where the goal is to predict a discrete category or label (like 'spam' vs 'not spam', or 'cat' vs 'dog').
Think of regression as trying to answer "how much?" or "what value?". Here are some typical examples where regression techniques are applied:
In each case, the target variable we aim to predict (price, units sold, kWh, credit score, yield) can take on a wide range of continuous values.
Mathematically, we can frame a regression problem as follows: We have a dataset consisting of n observations. Each observation i has a set of input features (also called independent variables or predictors), denoted as a vector Xi, and a corresponding continuous target variable (or dependent variable, response), denoted as yi.
Our objective is to learn a model, essentially a function f, that can approximate the relationship between the features X and the target y:
y≈f(X)The function f is learned from the training data. Once learned, the model can be used to predict the target value ynew for new, unseen feature inputs Xnew.
For simple cases, especially with only one or two features, we can visualize the relationship between features and the target. If we have a single feature x and a target y, we can create a scatter plot where each point represents an observation (xi,yi). The goal of a regression model, in this visual context, is often to find a line or curve that best fits the pattern of these points.
A scatter plot illustrating a potential relationship between a single feature (Study Hours) and a continuous target variable (Exam Score). A regression model would aim to capture this trend.
The challenge, and the purpose of algorithms like linear regression (which we'll explore next), is to find the specific parameters of the function f that best capture this underlying relationship based on the provided data. Scikit-learn provides efficient implementations of various algorithms designed to solve exactly these types of problems.
© 2025 ApX Machine Learning