A common goal in supervised machine learning is to learn a mapping from input features to an output variable. When the output variable we are trying to predict is a continuous numerical value, we are dealing with a regression problem. This contrasts with classification problems, covered in the next chapter, where the goal is to predict a discrete category or label (like 'spam' vs 'not spam', or 'cat' vs 'dog').Think of regression as trying to answer "how much?" or "what value?". Here are some typical examples where regression techniques are applied:Predicting Housing Prices: Given features like square footage, number of bedrooms, and location, predict the selling price of a house.Forecasting Demand: Estimate the number of units of a product that will be sold next month based on past sales data, advertising spend, and seasonality.Estimating Energy Consumption: Predict a building's electricity usage based on factors like outside temperature, time of day, and occupancy.Assessing Financial Risk: Calculate a credit score (a continuous value) for a loan applicant based on their financial history.Predicting Crop Yield: Estimate the yield of a crop based on weather patterns, soil type, and amount of fertilizer used.In each case, the target variable we aim to predict (price, units sold, kWh, credit score, yield) can take on a wide range of continuous values.Core Components of a Regression ProblemMathematically, we can frame a regression problem as follows: We have a dataset consisting of $n$ observations. Each observation $i$ has a set of input features (also called independent variables or predictors), denoted as a vector $X_i$, and a corresponding continuous target variable (or dependent variable, response), denoted as $y_i$.Our objective is to learn a model, essentially a function $f$, that can approximate the relationship between the features $X$ and the target $y$:$$ y \approx f(X) $$The function $f$ is learned from the training data. Once learned, the model can be used to predict the target value $y_{new}$ for new, unseen feature inputs $X_{new}$.Visualizing RegressionFor simple cases, especially with only one or two features, we can visualize the relationship between features and the target. If we have a single feature $x$ and a target $y$, we can create a scatter plot where each point represents an observation $(x_i, y_i)$. The goal of a regression model, in this visual context, is often to find a line or curve that best fits the pattern of these points.{"layout": {"xaxis": {"title": "Feature (e.g., Study Hours)"}, "yaxis": {"title": "Target (e.g., Exam Score)"}, "title": "Example Regression Data", "margin": {"l": 40, "r": 20, "t": 40, "b": 40}}, "data": [{"type": "scatter", "mode": "markers", "x": [1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9, 9.5, 10], "y": [45, 50, 55, 58, 62, 65, 68, 70, 75, 78, 80, 83, 85, 88, 90, 92, 94, 96, 98], "marker": {"color": "#339af0", "size": 8}}]}A scatter plot illustrating a potential relationship between a single feature (Study Hours) and a continuous target variable (Exam Score). A regression model would aim to capture this trend.The challenge, and the purpose of algorithms like linear regression (which we'll explore next), is to find the specific parameters of the function $f$ that best capture this underlying relationship based on the provided data. Scikit-learn provides efficient implementations of various algorithms designed to solve exactly these types of problems.