Having explored the theory behind linear regression, including the model structure (y=β0+β1x+ϵ), least squares estimation, coefficient interpretation, and evaluation metrics, we now turn to implementing these models in Python. Two popular libraries dominate this space: statsmodels
and scikit-learn
. Each has its strengths, and understanding both provides a versatile toolkit for regression tasks.
statsmodels
is often favored for its emphasis on statistical inference and detailed model analysis, mirroring outputs commonly found in statistical software like R. scikit-learn
is part of a broader machine learning ecosystem, offering a consistent API for various algorithms, making it suitable for building predictive pipelines.
statsmodels
provides comprehensive tools for estimating many different statistical models, as well as for conducting statistical tests and data exploration. Its formula.api
submodule allows specifying models using a string-based formula, similar to R, which can be quite intuitive.
Let's illustrate with a simple linear regression example. Assume we have a pandas DataFrame df
with columns 'TargetVariable' (y) and 'PredictorVariable' (x).
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
# Sample Data (replace with your actual data)
data = {'PredictorVariable': np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]),
'TargetVariable': np.array([2.5, 3.1, 4.5, 5.2, 6.8, 7.1, 8.5, 9.2, 10.1, 11.5])}
df = pd.DataFrame(data)
# Define the model using the formula syntax
# 'TargetVariable ~ PredictorVariable' means model TargetVariable as a function of PredictorVariable
# statsmodels automatically includes an intercept
model_formula = 'TargetVariable ~ PredictorVariable'
model = smf.ols(formula=model_formula, data=df)
# Fit the model to the data
results = model.fit()
# Print the comprehensive summary
print(results.summary())
The results.summary()
method provides a rich output, containing several important pieces of information:
coef
: The estimated values for the intercept (β^0) and the coefficient for PredictorVariable
(β^1).std err
: The standard error of the estimates, indicating their variability.t
: The t-statistic, used to test if each coefficient is significantly different from zero.P>|t|
: The p-value associated with the t-statistic. A small p-value (typically < 0.05) suggests the coefficient is statistically significant.[0.025 0.975]
: The 95% confidence interval for the coefficient.This detailed summary is a major advantage of statsmodels
when your primary goal is understanding the statistical properties and significance of the relationships within your data.
scikit-learn
provides a more streamlined interface, consistent across different machine learning models, which is advantageous when integrating regression into larger workflows or comparing it with other predictive algorithms.
Here's how to perform simple linear regression using scikit-learn
:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Sample Data (same as before)
data = {'PredictorVariable': np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]),
'TargetVariable': np.array([2.5, 3.1, 4.5, 5.2, 6.8, 7.1, 8.5, 9.2, 10.1, 11.5])}
df = pd.DataFrame(data)
# Prepare the data
# X needs to be a 2D array (or DataFrame); y is a 1D array (or Series)
X = df[['PredictorVariable']] # Note the double brackets to keep it as a DataFrame
y = df['TargetVariable']
# Initialize the model
sk_model = LinearRegression()
# Fit the model
sk_model.fit(X, y)
# Get the estimated coefficients
intercept = sk_model.intercept_ # Beta_0
coefficient = sk_model.coef_[0] # Beta_1 (it's an array)
print(f"Intercept (beta_0): {intercept:.4f}")
print(f"Coefficient (beta_1): {coefficient:.4f}")
# Make predictions
y_pred = sk_model.predict(X)
# Evaluate the model
mse = mean_squared_error(y, y_pred)
r2 = r2_score(y, y_pred)
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"R-squared (R2): {r2:.4f}")
Key points about the scikit-learn
approach:
scikit-learn
generally expects the features (X
) to be a 2D array-like structure (e.g., a DataFrame or a NumPy array of shape (n_samples, n_features)
) and the target (y
) to be a 1D array-like structure (e.g., a pandas Series or a NumPy array of shape (n_samples,)
).fit()
method trains the model, and predict()
generates predictions on new data. Model parameters like coefficients are stored as attributes of the fitted model object (e.g., sk_model.intercept_
, sk_model.coef_
).sklearn.metrics
module, requiring the true target values and the model's predictions.scikit-learn
provides less built-in statistical summary output compared to statsmodels
. Its strength lies in its consistent API, ease of integration into machine learning pipelines (including data preprocessing, cross-validation, and model selection), and support for a vast range of algorithms beyond regression.
Both libraries easily accommodate multiple predictor variables.
Statsmodels: Simply add more variables to the formula string.
# Assuming df has columns 'Predictor1', 'Predictor2', 'TargetVariable'
model_formula_multi = 'TargetVariable ~ Predictor1 + Predictor2'
multi_model_sm = smf.ols(formula=model_formula_multi, data=df).fit()
# print(multi_model_sm.summary()) # Summary includes coefficients for Predictor1 and Predictor2
Scikit-learn: Include more columns in the feature DataFrame X
.
# Assuming df has columns 'Predictor1', 'Predictor2', 'TargetVariable'
X_multi = df[['Predictor1', 'Predictor2']]
y = df['TargetVariable']
multi_model_sk = LinearRegression()
multi_model_sk.fit(X_multi, y)
# Coefficients will be an array with values for Predictor1 and Predictor2
# print(multi_model_sk.coef_)
# print(multi_model_sk.intercept_)
statsmodels
when your focus is on statistical inference, hypothesis testing for coefficients, understanding confidence intervals, and getting detailed diagnostic summaries. It excels at explaining the relationships within your current data.scikit-learn
when your primary goal is prediction, integrating regression into a machine learning pipeline, performing cross-validation, or comparing regression with other predictive models using a consistent interface.Familiarity with both libraries provides flexibility. You might use statsmodels
for initial exploratory analysis and model understanding, then switch to scikit-learn
for building a deployable predictive model. The hands-on practical section that follows will give you direct experience applying these tools to a dataset.
© 2025 ApX Machine Learning