Now that you understand the basic idea behind linear regression, let's see how to put it into practice using Scikit-learn. Scikit-learn provides a consistent and straightforward interface for using various machine learning models, including linear regression. The primary tool for this is the LinearRegression
estimator found within the sklearn.linear_model
module.
Scikit-learn's design revolves around the concept of "Estimators". An estimator is any object that learns from data; it can be a classification, regression, or clustering algorithm, or a transformer that extracts useful features. All estimators follow a consistent pattern:
LinearRegression
has few).X
and a target vector y
. X
is typically a 2D NumPy array or Pandas DataFrame (shape: [n_samples, n_features]
), and y
is a 1D NumPy array or Pandas Series (shape: [n_samples]
)..fit(X, y)
method. This is the step where the model learns from the data. For linear regression, fit
calculates the optimal coefficients (β1,...,βp) and the intercept (β0) by minimizing the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation..predict(X_new)
method. X_new
should have the same number of features as the training data X
.LinearRegression
Let's walk through a simple example. We'll generate some synthetic data that follows a roughly linear pattern and then fit a LinearRegression
model to it.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# 1. Generate some sample data
# Make results reproducible
np.random.seed(42)
X = 2 * np.random.rand(100, 1) # Feature (needs to be 2D for Scikit-learn)
y = 4 + 3 * X + np.random.randn(100, 1) # Target variable with some noise
# 2. Import the estimator
# Already done above: from sklearn.linear_model import LinearRegression
# 3. Instantiate the estimator
model = LinearRegression()
# 4. Fit the model to the data
# Scikit-learn expects X (features) and y (target)
model.fit(X, y)
# 5. Inspect the learned parameters
# The intercept (beta_0) is stored in .intercept_
# The coefficients (beta_1, ..., beta_p) are stored in .coef_
print(f"Intercept (beta_0): {model.intercept_[0]:.4f}")
print(f"Coefficient (beta_1): {model.coef_[0][0]:.4f}")
# 6. Make predictions on new data
# Let's predict values for X = 0 and X = 2
X_new = np.array([[0], [2]]) # New data points must be 2D array
y_pred = model.predict(X_new)
print(f"\nPredictions for X_new = [[0], [2]]:")
print(y_pred)
# Optional: Visualize the results
plt.figure(figsize=(8, 5))
plt.scatter(X, y, alpha=0.7, label='Original Data')
plt.plot(X_new, y_pred, "r-", linewidth=2, label='Fitted Regression Line')
plt.xlabel("Feature (X)")
plt.ylabel("Target (y)")
plt.title("Linear Regression Fit")
plt.legend()
plt.grid(True)
plt.show()
Running this code will:
y
is approximately 4+3x.LinearRegression
model.model.fit(X, y)
. During this step, Scikit-learn applies the ordinary least squares method to find the line that best fits the data points by minimizing the sum of the squared differences between the actual y
values and the predicted values (y^) from the line.model.intercept_
) and coefficient (model.coef_
). You should see values close to the original parameters (4 and 3) used to generate the data. The slight difference is due to the random noise we added.model.predict()
to calculate the predicted y
values for two new input values (0 and 2).Scatter plot of the generated data points along with the linear regression line fitted by Scikit-learn.
The process shown above works exactly the same way for multiple linear regression (where you have more than one input feature). The only difference is that your X
matrix will have more than one column (one column per feature). Scikit-learn's LinearRegression
handles this automatically.
# Example with 2 features
X_multi = 2 * np.random.rand(100, 2) # Now X has 2 columns
# y = 4 + 3*X_1 + 5*X_2 + noise
y_multi = 4 + 3 * X_multi[:, 0] + 5 * X_multi[:, 1] + np.random.randn(100)
y_multi = y_multi.reshape(-1, 1) # Make y a column vector if needed
multi_model = LinearRegression()
multi_model.fit(X_multi, y_multi)
print(f"\nMultiple Regression:")
print(f"Intercept (beta_0): {multi_model.intercept_[0]:.4f}")
print(f"Coefficients (beta_1, beta_2): {multi_model.coef_[0]}")
# Prediction requires input with 2 features
X_multi_new = np.array([[0, 0], [2, 3]]) # Predict for [X1=0, X2=0] and [X1=2, X2=3]
y_multi_pred = multi_model.predict(X_multi_new)
print(f"\nPredictions for X_multi_new:")
print(y_multi_pred)
The .coef_
attribute will now contain an array with multiple values, one coefficient for each feature column in X_multi
. The interpretation is similar: each coefficient represents the change in the target variable y
for a one-unit change in the corresponding feature, assuming all other features are held constant.
This consistent API makes it simple to apply linear regression, whether you have one feature or many. After fitting the model, the next important steps involve understanding what the learned coefficients mean and evaluating how well the model actually performs, which we will cover in the following sections.
© 2025 ApX Machine Learning