All Courses

Hands-on Practical: Building a Regression Model

Okay, let's put the theory into practice. In this section, we'll walk through the complete process of building, training, and evaluating a simple linear regression model using Scikit-learn. We'll use a real dataset to predict a continuous target variable, applying the concepts and tools you've learned about so far in this chapter.

Setting the Stage: Imports and Data

First, we need to import the necessary libraries and modules. We'll need Pandas for potential data manipulation (though Scikit-learn datasets often return NumPy arrays or Bunch objects), Scikit-learn for the dataset, model, splitting function, and metrics.

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import plotly.graph_objects as go # For visualization

We'll use the California Housing dataset, a popular dataset for regression tasks available directly within Scikit-learn. The goal is to predict the median house value for California districts, given various features based on census data.

# Load the dataset
california = fetch_california_housing(as_frame=True)
X = california.data
y = california.target

# Display some information about the data
print("Features (X):")
print(X.head())
print("\nTarget (y) - Median House Value:")
print(y.head())
print("\nDataset Description:")
print(california.DESCR[:500] + "...") # Print first 500 chars of description

The output shows the first few rows of our features (like median income, house age, average rooms) and the target variable (median house value). The description provides context about the features and the prediction task.

Splitting Data for Reliable Evaluation

Before training, it's standard practice to split the dataset into two parts: a training set and a testing set. The model learns patterns from the training set. We then evaluate its performance on the unseen testing set to get an unbiased estimate of how well it generalizes to new data. We use Scikit-learn's train_test_split function for this.

# Split data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

We set test_size=0.2 to allocate 20% of the data for testing and use random_state for reproducibility, ensuring we get the same split each time the code runs.

Training the Linear Regression Model

Now we instantiate the LinearRegression model and fit it using our training data (X_train and y_train). The fit method is where the model learns the relationship between the features and the target variable, calculating the optimal coefficients for the linear equation.

# Create a Linear Regression model instance
model = LinearRegression()

# Train the model using the training data
model.fit(X_train, y_train)

print("Model training complete.")
print(f"Intercept: {model.intercept_}")
print(f"Coefficients: {model.coef_}")

After fitting, the model has learned the intercept and the coefficients for each feature. These represent the linear relationship identified in the training data.

Making Predictions

With the model trained, we can now use it to make predictions on new, unseen data. We use the predict method on our test set (X_test).

# Make predictions on the test set
y_pred = model.predict(X_test)

# Display the first 5 predictions and actual values
print("First 5 Predictions:", y_pred[:5])
print("First 5 Actual Values:", y_test[:5].values)

The model outputs an array of predicted median house values based on the features in X_test. We can compare these predictions (y_pred) to the actual known values (y_test).

Evaluating Model Performance

How good are these predictions? We need quantitative measures to assess the model's performance. We'll use the metrics discussed earlier: Mean Absolute Error (MAE), Mean Squared Error (MSE), and the R-squared ( $R^2$ ) score. These are calculated by comparing the predictions (y_pred) with the actual values (y_test).

# Calculate evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse) # Calculate Root Mean Squared Error
r2 = r2_score(y_test, y_pred)

print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
print(f"R-squared (R2 Score): {r2:.4f}")

Let's interpret these results:

MAE: On average, the model's predictions are off by approximately $0.53 median house value units (in$ 100,000s, so $53,000).
MSE/RMSE: The RMSE is about $0.72 units ($ 72,000). Like MAE, it measures prediction error, but MSE (and thus RMSE) penalizes larger errors more heavily due to the squaring.
R² Score: An $R^2$ of about 0.59 indicates that approximately 59% of the variance in the median house values in the test set is explained by our model based on the features. An $R^2$ of 1 would be a perfect fit.

Visualizing Predictions vs. Actual Values

A scatter plot comparing the actual values (y_test) against the predicted values (y_pred) provides a visual assessment of the model's performance. For a good model, we expect the points to cluster closely around the diagonal line where predicted equals actual.

Scatter plot comparing predicted median house values (y-axis) against the actual values (x-axis) for a sample of the test data. The dashed red line represents a perfect prediction (y=x). Points closer to this line indicate better predictions.

This visualization confirms the metrics. While there's a clear positive correlation, the points are somewhat scattered around the ideal line, reflecting the calculated $R^2$ score of around 0.59 and the non-zero error metrics.

You have now successfully built, trained, predicted with, and evaluated a linear regression model using Scikit-learn. This practical workflow forms the basis for tackling many regression problems. In later chapters, we'll explore more sophisticated models and techniques for improving performance.

Was this section helpful?