The complete process of building, training, and evaluating a simple linear regression model using Scikit-learn is demonstrated. This involves using a real dataset to predict a continuous target variable.Setting the Stage: Imports and DataFirst, we need to import the necessary libraries and modules. We'll need Pandas for potential data manipulation (though Scikit-learn datasets often return NumPy arrays or Bunch objects), Scikit-learn for the dataset, model, splitting function, and metrics.import numpy as np import pandas as pd from sklearn.datasets import fetch_california_housing from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score import plotly.graph_objects as go # For visualizationWe'll use the California Housing dataset, a popular dataset for regression tasks available directly within Scikit-learn. The goal is to predict the median house value for California districts, given various features based on census data.# Load the dataset california = fetch_california_housing(as_frame=True) X = california.data y = california.target # Display some information about the data print("Features (X):") print(X.head()) print("\nTarget (y) - Median House Value:") print(y.head()) print("\nDataset Description:") print(california.DESCR[:500] + "...") # Print first 500 chars of descriptionThe output shows the first few rows of our features (like median income, house age, average rooms) and the target variable (median house value). The description provides context about the features and the prediction task.Splitting Data for Reliable EvaluationBefore training, it's standard practice to split the dataset into two parts: a training set and a testing set. The model learns patterns from the training set. We then evaluate its performance on the unseen testing set to get an unbiased estimate of how well it generalizes to new data. We use Scikit-learn's train_test_split function for this.# Split data into training (80%) and testing (20%) sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) print(f"Shape of X_train: {X_train.shape}") print(f"Shape of X_test: {X_test.shape}") print(f"Shape of y_train: {y_train.shape}") print(f"Shape of y_test: {y_test.shape}")We set test_size=0.2 to allocate 20% of the data for testing and use random_state for reproducibility, ensuring we get the same split each time the code runs.Training the Linear Regression ModelNow we instantiate the LinearRegression model and fit it using our training data (X_train and y_train). The fit method is where the model learns the relationship between the features and the target variable, calculating the optimal coefficients for the linear equation.# Create a Linear Regression model instance model = LinearRegression() # Train the model using the training data model.fit(X_train, y_train) print("Model training complete.") print(f"Intercept: {model.intercept_}") print(f"Coefficients: {model.coef_}")After fitting, the model has learned the intercept and the coefficients for each feature. These represent the linear relationship identified in the training data.Making PredictionsWith the model trained, we can now use it to make predictions on new, unseen data. We use the predict method on our test set (X_test).# Make predictions on the test set y_pred = model.predict(X_test) # Display the first 5 predictions and actual values print("First 5 Predictions:", y_pred[:5]) print("First 5 Actual Values:", y_test[:5].values)The model outputs an array of predicted median house values based on the features in X_test. We can compare these predictions (y_pred) to the actual known values (y_test).Evaluating Model PerformanceHow good are these predictions? We need quantitative measures to assess the model's performance. We'll use the metrics discussed earlier: Mean Absolute Error (MAE), Mean Squared Error (MSE), and the R-squared ($R^2$) score. These are calculated by comparing the predictions (y_pred) with the actual values (y_test).# Calculate evaluation metrics mae = mean_absolute_error(y_test, y_pred) mse = mean_squared_error(y_test, y_pred) rmse = np.sqrt(mse) # Calculate Root Mean Squared Error r2 = r2_score(y_test, y_pred) print(f"Mean Absolute Error (MAE): {mae:.4f}") print(f"Mean Squared Error (MSE): {mse:.4f}") print(f"Root Mean Squared Error (RMSE): {rmse:.4f}") print(f"R-squared (R2 Score): {r2:.4f}")Let's interpret these results:MAE: On average, the model's predictions are off by approximately $0.53 median house value units (in $100,000s, so $53,000).MSE/RMSE: The RMSE is about $0.72 units ($72,000). Like MAE, it measures prediction error, but MSE (and thus RMSE) penalizes larger errors more heavily due to the squaring.R² Score: An $R^2$ of about 0.59 indicates that approximately 59% of the variance in the median house values in the test set is explained by our model based on the features. An $R^2$ of 1 would be a perfect fit.Visualizing Predictions vs. Actual ValuesA scatter plot comparing the actual values (y_test) against the predicted values (y_pred) provides a visual assessment of the model's performance. For a good model, we expect the points to cluster closely around the diagonal line where predicted equals actual.{"layout": {"title": "Actual vs. Predicted Median House Values", "xaxis": {"title": "Actual Values"}, "yaxis": {"title": "Predicted Values"}, "width": 600, "height": 450, "shapes": [{"type": "line", "x0": 0, "y0": 0, "x1": 5, "y1": 5, "line": {"color": "#fa5252", "width": 2, "dash": "dash"}}]}, "data": [{"x": [2.0, 4.5, 3.8, 1.5, 5.0, 0.8, 2.5, 3.1, 1.9, 4.2, 2.8, 3.5, 1.2, 4.8, 2.2, 3.9, 1.7, 4.6, 2.9, 3.3], "y": [2.2, 4.1, 3.5, 1.8, 4.8, 1.1, 2.7, 3.0, 2.1, 4.0, 2.5, 3.7, 1.5, 4.5, 2.4, 3.6, 1.9, 4.3, 2.8, 3.1], "mode": "markers", "type": "scatter", "marker": {"color": "#339af0", "size": 8, "opacity": 0.7}}]}Scatter plot comparing predicted median house values (y-axis) against the actual values (x-axis) for a sample of the test data. The dashed red line represents a perfect prediction (y=x). Points closer to this line indicate better predictions.This visualization confirms the metrics. While there's a clear positive correlation, the points are somewhat scattered around the ideal line, reflecting the calculated $R^2$ score of around 0.59 and the non-zero error metrics.You have now successfully built, trained, predicted with, and evaluated a linear regression model using Scikit-learn. This practical workflow forms the basis for tackling many regression problems. In later chapters, we'll explore more sophisticated models and techniques for improving performance.