Linear regression involves fundamental concepts such as the model structure (y=β0+β1x+ϵ), least squares estimation, coefficient interpretation, and evaluation metrics. Implementing these models in Python is a common practice. Two popular libraries dominate this area: statsmodels and scikit-learn. Each has its strengths, and understanding both provides a versatile toolkit for regression tasks.
statsmodels is often favored for its emphasis on statistical inference and detailed model analysis, mirroring outputs commonly found in statistical software like R. scikit-learn is part of a broader machine learning ecosystem, offering a consistent API for various algorithms, making it suitable for building predictive pipelines.
statsmodels provides comprehensive tools for estimating many different statistical models, as well as for conducting statistical tests and data exploration. Its formula.api submodule allows specifying models using a string-based formula, similar to R, which can be quite intuitive.
Let's illustrate with a simple linear regression example. Assume we have a pandas DataFrame df with columns 'TargetVariable' (y) and 'PredictorVariable' (x).
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
# Sample Data (replace with your actual data)
data = {'PredictorVariable': np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]),
'TargetVariable': np.array([2.5, 3.1, 4.5, 5.2, 6.8, 7.1, 8.5, 9.2, 10.1, 11.5])}
df = pd.DataFrame(data)
# Define the model using the formula syntax
# 'TargetVariable ~ PredictorVariable' means model TargetVariable as a function of PredictorVariable
# statsmodels automatically includes an intercept
model_formula = 'TargetVariable ~ PredictorVariable'
model = smf.ols(formula=model_formula, data=df)
# Fit the model to the data
results = model.fit()
# Print the comprehensive summary
print(results.summary())
The results.summary() method provides a rich output, containing several important pieces of information:
coef: The estimated values for the intercept (β^0) and the coefficient for PredictorVariable (β^1).std err: The standard error of the estimates, indicating their variability.t: The t-statistic, used to test if each coefficient is significantly different from zero.P>|t|: The p-value associated with the t-statistic. A small p-value (typically < 0.05) suggests the coefficient is statistically significant.[0.025 0.975]: The 95% confidence interval for the coefficient.This detailed summary is a major advantage of statsmodels when your primary goal is understanding the statistical properties and significance of the relationships within your data.
scikit-learn provides a more streamlined interface, consistent across different machine learning models, which is advantageous when integrating regression into larger workflows or comparing it with other predictive algorithms.
Here's how to perform simple linear regression using scikit-learn:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Sample Data (same as before)
data = {'PredictorVariable': np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]),
'TargetVariable': np.array([2.5, 3.1, 4.5, 5.2, 6.8, 7.1, 8.5, 9.2, 10.1, 11.5])}
df = pd.DataFrame(data)
# Prepare the data
# X needs to be a 2D array (or DataFrame); y is a 1D array (or Series)
X = df[['PredictorVariable']] # Note the double brackets to keep it as a DataFrame
y = df['TargetVariable']
# Initialize the model
sk_model = LinearRegression()
# Fit the model
sk_model.fit(X, y)
# Get the estimated coefficients
intercept = sk_model.intercept_ # Beta_0
coefficient = sk_model.coef_[0] # Beta_1 (it's an array)
print(f"Intercept (beta_0): {intercept:.4f}")
print(f"Coefficient (beta_1): {coefficient:.4f}")
# Make predictions
y_pred = sk_model.predict(X)
# Evaluate the model
mse = mean_squared_error(y, y_pred)
r2 = r2_score(y, y_pred)
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"R-squared (R2): {r2:.4f}")
Important points about the scikit-learn approach:
scikit-learn generally expects the features (X) to be a 2D array-like structure (e.g., a DataFrame or a NumPy array of shape (n_samples, n_features)) and the target (y) to be a 1D array-like structure (e.g., a pandas Series or a NumPy array of shape (n_samples,)).fit() method trains the model, and predict() generates predictions on new data. Model parameters like coefficients are stored as attributes of the fitted model object (e.g., sk_model.intercept_, sk_model.coef_).sklearn.metrics module, requiring the true target values and the model's predictions.scikit-learn provides less built-in statistical summary output compared to statsmodels. Its strength lies in its consistent API, ease of integration into machine learning pipelines (including data preprocessing, cross-validation, and model selection), and support for a wide range of algorithms.
Both libraries easily accommodate multiple predictor variables.
Statsmodels: Simply add more variables to the formula string.
# Assuming df has columns 'Predictor1', 'Predictor2', 'TargetVariable'
model_formula_multi = 'TargetVariable ~ Predictor1 + Predictor2'
multi_model_sm = smf.ols(formula=model_formula_multi, data=df).fit()
# print(multi_model_sm.summary()) # Summary includes coefficients for Predictor1 and Predictor2
Scikit-learn: Include more columns in the feature DataFrame X.
# Assuming df has columns 'Predictor1', 'Predictor2', 'TargetVariable'
X_multi = df[['Predictor1', 'Predictor2']]
y = df['TargetVariable']
multi_model_sk = LinearRegression()
multi_model_sk.fit(X_multi, y)
# Coefficients will be an array with values for Predictor1 and Predictor2
# print(multi_model_sk.coef_)
# print(multi_model_sk.intercept_)
statsmodels when your focus is on statistical inference, hypothesis testing for coefficients, understanding confidence intervals, and getting detailed diagnostic summaries. It excels at explaining the relationships within your current data.scikit-learn when your primary goal is prediction, integrating regression into a machine learning pipeline, performing cross-validation, or comparing regression with other predictive models using a consistent interface.Familiarity with both libraries provides flexibility. You might use statsmodels for initial exploratory analysis and model understanding, then switch to scikit-learn for building a deployable predictive model. The hands-on practical section that follows will give you direct experience applying these tools to a dataset.
Was this section helpful?
statsmodels library, providing a detailed guide on using its statistical models, including OLS, for inference and comprehensive output analysis.scikit-learn's linear models, explaining the API for LinearRegression and its integration into machine learning workflows.scikit-learn, featuring hands-on examples for linear regression and broader ML concepts.© 2026 ApX Machine LearningEngineered with