Alright, let's put the theory of linear regression into practice. In the previous sections, we discussed the simple linear regression model y=β0+β1x+ϵ, how to estimate the coefficients β0 (intercept) and β1 (slope) using the method of least squares, and how to evaluate the model using metrics like R2 and Mean Squared Error (MSE). Now, we'll walk through fitting such a model to data and interpreting the results using Python.
We'll use common Python libraries: Pandas for data handling, Matplotlib/Seaborn or Plotly for visualization, and Statsmodels and Scikit-learn for building the regression model itself. Statsmodels often provides more detailed statistical summaries useful for inference, while Scikit-learn is widely used in machine learning pipelines for prediction tasks. We'll look at both.
First, ensure you have the necessary libraries installed. If not, you can typically install them using pip:
pip install pandas numpy statsmodels scikit-learn plotly
Now, let's import them and create some sample data. Imagine we have data tracking advertising spending (in thousands of dollars) and corresponding sales (in thousands of units).
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
# Generate synthetic data for reproducibility
np.random.seed(42) # for reproducible results
advertising_spend = np.random.rand(50) * 10 # Spending from 0 to 10 thousand dollars
# Sales = roughly 50 + 3*Spend + some noise
sales = 50 + 3 * advertising_spend + np.random.randn(50) * 5
# Create a Pandas DataFrame
data = pd.DataFrame({'AdvertisingSpend': advertising_spend, 'Sales': sales})
print(data.head())
# Output:
# AdvertisingSpend Sales
# 0 3.745401 61.800130
# 1 9.507143 78.739061
# 2 7.319939 70.486096
# 3 5.986585 68.583330
# 4 1.560186 54.389472
Before fitting a model, it's always a good idea to visualize the relationship between the variables. A scatter plot is perfect for this.
# Create scatter plot using Plotly Express
fig_scatter = px.scatter(data, x='AdvertisingSpend', y='Sales',
title='Sales vs. Advertising Spend',
labels={'AdvertisingSpend': 'Advertising Spend ($ Thousands)', 'Sales': 'Sales ($ Thousands)'},
template='plotly_white') # Use a clean template
# Enhance layout for web display
fig_scatter.update_layout(
title_x=0.5, # Center title
margin=dict(l=40, r=40, t=50, b=40), # Adjust margins
width=600, # Set width
height=400 # Set height
)
# Show plot (in a notebook/script) or convert to JSON for web embedding
# fig_scatter.show()
# For web embedding:
scatter_json = fig_scatter.to_json(pretty=False)
{"layout": {"template": {"layout": {"font": {"color": "#2a3f5f"}, "hoverlabel": {"align": "left"}, "hovermode": "closest", "paper_bgcolor": "white", "plot_bgcolor": "white", "colorway": ["#636efa", "#ef553b", "#00cc96", "#ab63fa", "#ffa15a", "#19d3f3", "#ff6692", "#b6e880", "#ff97ff", "#fecb52"], "colorscale": {"sequential": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}, "xaxis": {"gridcolor": "#EBF0F8", "linecolor": "#EBF0F8", "ticks": "", "title": {"standoff": 15}, "zerolinecolor": "#EBF0F8", "automargin": true}, "yaxis": {"gridcolor": "#EBF0F8", "linecolor": "#EBF0F8", "ticks": "", "title": {"standoff": 15}, "zerolinecolor": "#EBF0F8", "automargin": true}, "coloraxis": {"colorbar": {"ticks": ""}}, "shapedefaults": {"line": {"color": "#2a3f5f"}}, "annotationdefaults": {"arrowhead": 0, "arrowwidth": 1}, "bargap": 0.2}}, "data": [{"type": "scattergl"}]}, "title": {"text": "Sales vs. Advertising Spend", "x": 0.5}, "xaxis": {"title": {"text": "Advertising Spend ($ Thousands)"}, "anchor": "y", "domain": [0.0, 1.0]}, "yaxis": {"title": {"text": "Sales ($ Thousands)"}, "anchor": "x", "domain": [0.0, 1.0]}, "margin": {"l": 40, "r": 40, "t": 50, "b": 40}, "width": 600, "height": 400}, "data": [{"type": "scatter", "x": [3.745401188473625, 9.50714306409916, 7.319939418114051, 5.986584841970366, 1.5601864044243652, 1.5599452033620265, 0.5808361216819946, 8.66176145774935, 6.011150117432088, 7.080725777960457, 0.20584494295802448, 9.69909852161994, 8.324426408004217, 2.1233911067827616, 1.8182496720710063, 1.834045098534338, 3.042422429595377, 5.247564316322378, 4.3194501864211575, 2.912291401980419, 6.118528947223795, 1.3949386065204183, 2.9214464853521816, 3.66361835606538, 4.56069983752664, 7.85175961393095, 1.9967378215835904, 5.14234438405513, 8.504421146417164, 2.088767706800416, 1.3031570018846792, 8.12168728231102, 5.16167519491869, 8.87785748876001, 9.873558174800375, 4.633151506086349, 0.7190915198954935, 4.140022554841212, 2.918760687989861, 3.142989801916041, 3.868788534230207, 0.4470353984896789, 7.116790414754402, 8.990911162659834, 7.007119440238783, 5.624138381999047, 6.76067084021565, 3.883892685767925, 3.58645010809966, 9.085955020434004], "y": [61.80012969975683, 78.73906091351787, 70.48609571799993, 68.58333009576938, 54.38947220216001, 53.7772878291532, 50.13939085263499, 80.08221261610163, 64.49121061642705, 72.60906729040437, 49.45487334140803, 80.4395260200674, 75.03156091896634, 55.58780964481221, 54.354578935524774, 55.95421309121784, 54.6820349382979, 69.66634592969637, 62.38520226483979, 61.8559534886734, 67.94800296679476, 53.22476663315619, 60.03752608566032, 64.63280671927132, 59.689060281378535, 78.02727248338617, 49.72821196973287, 64.9080900854904, 73.9345471433459, 53.21330679777131, 53.73463325876348, 78.11090977187613, 67.0268285091743, 74.76335420341864, 78.98471012903892, 63.50039929373398, 52.82209933476746, 66.4780023052012, 56.45234436878183, 60.94211159472224, 61.87195122034747, 55.83363154871973, 71.0769800280368, 79.50963290837518, 69.58522348638993, 65.04416704668687, 70.5609155313511, 58.211290148369636, 59.78560343219218, 76.05471456487646], "mode": "markers", "marker": {"symbol": "circle", "size": 6}}]}
Scatter plot showing the relationship between Advertising Spend and Sales. A positive linear trend appears visible.
The scatter plot suggests a positive linear relationship: as advertising spend increases, sales tend to increase as well. This visual confirmation supports the use of a linear regression model.
Statsmodels provides a class OLS
(Ordinary Least Squares) that we can use. It requires us to explicitly add a constant term (the intercept β0) to our predictor variable(s).
# Prepare the data for Statsmodels
X = data['AdvertisingSpend']
y = data['Sales']
X = sm.add_constant(X) # Add an intercept term to the predictor
# Fit the OLS model
model_sm = sm.OLS(y, X)
results_sm = model_sm.fit()
# Print the model summary
print(results_sm.summary())
This summary output is rich in information:
OLS Regression Results
==============================================================================
Dep. Variable: Sales R-squared: 0.891
Model: OLS Adj. R-squared: 0.888
Method: Least Squares F-statistic: 391.6
Date: Wed, 15 May 2024 Prob (F-statistic): 1.33e-24
Time: 12:00:00 Log-Likelihood: -130.91
No. Observations: 50 AIC: 265.8
Df Residuals: 48 BIC: 269.6
Df Model: 1
Covariance Type: nonrobust
====================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------
const 50.9489 0.865 58.928 0.000 49.211 52.687
AdvertisingSpend 2.9094 0.147 19.790 0.000 2.614 3.205
==============================================================================
Omnibus: 0.513 Durbin-Watson: 2.205
Prob(Omnibus): 0.774 Jarque-Bera (JB): 0.621
Skew: 0.178 Prob(JB): 0.733
Kurtosis: 2.595 Cond. No. 11.7
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Interpreting the Statsmodels Summary:
Dep. Variable
: Confirms 'Sales' is our dependent variable.Model
: OLS (Ordinary Least Squares).R-squared
: 0.891. This means approximately 89.1% of the variance in Sales
can be explained by AdvertisingSpend
using this linear model. This is a good fit for our synthetic data.Adj. R-squared
: 0.888. Similar to R-squared but adjusted for the number of predictors. Useful when comparing models with different numbers of predictors.F-statistic
: 391.6. Tests the overall significance of the model.Prob (F-statistic)
: 1.33e-24. This is the p-value associated with the F-statistic. A very small value (typically < 0.05) indicates that the model as a whole is statistically significant. Our model is highly significant.coef
:
const
: 50.95. This is the estimated intercept (β^0). It suggests that if advertising spend were zero, the expected sales would be approximately 50.95 thousand units.AdvertisingSpend
: 2.91. This is the estimated slope (β^1). It indicates that for each additional thousand dollars spent on advertising, sales are expected to increase by approximately 2.91 thousand units.std err
: Standard errors of the coefficient estimates, measuring their precision.t
: t-statistic values for each coefficient, testing the null hypothesis that the coefficient is zero.P>|t|
: p-values associated with the t-statistics. The very small p-values (0.000) for both the constant and AdvertisingSpend
suggest that both are statistically significant predictors in the model.[0.025 0.975]
: 95% confidence intervals for the coefficients. We are 95% confident that the true intercept lies between 49.21 and 52.69, and the true slope for AdvertisingSpend
lies between 2.61 and 3.21.The bottom part of the summary includes diagnostic tests related to the model assumptions (like normality and independence of residuals), which we'll touch upon later.
Now let's perform the same task using Scikit-learn. It's more focused on the prediction aspect.
# Prepare data for Scikit-learn
# X needs to be a 2D array (or DataFrame)
X_sk = data[['AdvertisingSpend']] # Note the double brackets
y_sk = data['Sales']
# Initialize and fit the model
model_sk = LinearRegression()
model_sk.fit(X_sk, y_sk)
# Get the coefficients
intercept = model_sk.intercept_ # Beta_0
coefficient = model_sk.coef_[0] # Beta_1
print(f"Scikit-learn Intercept (beta_0): {intercept:.4f}")
print(f"Scikit-learn Coefficient (beta_1): {coefficient:.4f}")
# Output:
# Scikit-learn Intercept (beta_0): 50.9489
# Scikit-learn Coefficient (beta_1): 2.9094
# Make predictions on the training data
y_pred_sk = model_sk.predict(X_sk)
# Calculate evaluation metrics
mse = mean_squared_error(y_sk, y_pred_sk)
r2 = r2_score(y_sk, y_pred_sk) # Same as model_sk.score(X_sk, y_sk)
print(f"Scikit-learn Mean Squared Error (MSE): {mse:.4f}")
print(f"Scikit-learn R-squared (R2): {r2:.4f}")
# Output:
# Scikit-learn Mean Squared Error (MSE): 24.2842
# Scikit-learn R-squared (R2): 0.8906
You'll notice the coefficients (β^0≈50.95, β^1≈2.91) and the R2 value (0.891) are essentially identical to those obtained from Statsmodels. Scikit-learn provides easy access to the coefficients and common evaluation metrics like MSE and R2, but doesn't automatically generate the detailed statistical summary that Statsmodels does.
Let's visualize how well our fitted line represents the data. We can use the coefficients we found to plot the regression line on top of the scatter plot.
# Create the scatter plot again
fig_line = px.scatter(data, x='AdvertisingSpend', y='Sales',
title='Sales vs. Advertising Spend with Fitted Line',
labels={'AdvertisingSpend': 'Advertising Spend ($ Thousands)', 'Sales': 'Sales ($ Thousands)'},
template='plotly_white')
# Add the regression line
# Use coefficients from either model (they are the same)
# Line equation: y = intercept + coefficient * x
fig_line.add_trace(go.Scatter(x=data['AdvertisingSpend'], y=y_pred_sk, # Use predictions as y values for the line
mode='lines',
name='Fitted Line',
line=dict(color='#fa5252', width=2))) # Use a red color from the palette
# Enhance layout
fig_line.update_layout(
title_x=0.5,
margin=dict(l=40, r=40, t=50, b=40),
width=600,
height=400,
showlegend=True
)
# Convert to JSON for web embedding
line_json = fig_line.to_json(pretty=False)
{"layout": {"template": {"layout": {"font": {"color": "#2a3f5f"}, "hoverlabel": {"align": "left"}, "hovermode": "closest", "paper_bgcolor": "white", "plot_bgcolor": "white", "colorway": ["#636efa", "#ef553b", "#00cc96", "#ab63fa", "#ffa15a", "#19d3f3", "#ff6692", "#b6e880", "#ff97ff", "#fecb52"], "colorscale": {"sequential": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}, "xaxis": {"gridcolor": "#EBF0F8", "linecolor": "#EBF0F8", "ticks": "", "title": {"standoff": 15}, "zerolinecolor": "#EBF0F8", "automargin": true}, "yaxis": {"gridcolor": "#EBF0F8", "linecolor": "#EBF0F8", "ticks": "", "title": {"standoff": 15}, "zerolinecolor": "#EBF0F8", "automargin": true}, "coloraxis": {"colorbar": {"ticks": ""}}, "shapedefaults": {"line": {"color": "#2a3f5f"}}, "annotationdefaults": {"arrowhead": 0, "arrowwidth": 1}, "bargap": 0.2}}, "data": [{"type": "scattergl"}]}, "title": {"text": "Sales vs. Advertising Spend with Fitted Line", "x": 0.5}, "xaxis": {"title": {"text": "Advertising Spend ($ Thousands)"}, "anchor": "y", "domain": [0.0, 1.0]}, "yaxis": {"title": {"text": "Sales ($ Thousands)"}, "anchor": "x", "domain": [0.0, 1.0]}, "margin": {"l": 40, "r": 40, "t": 50, "b": 40}, "width": 600, "height": 400, "showlegend": true}, "data": [{"type": "scatter", "x": [3.745401188473625, 9.50714306409916, 7.319939418114051, 5.986584841970366, 1.5601864044243652, 1.5599452033620265, 0.5808361216819946, 8.66176145774935, 6.011150117432088, 7.080725777960457, 0.20584494295802448, 9.69909852161994, 8.324426408004217, 2.1233911067827616, 1.8182496720710063, 1.834045098534338, 3.042422429595377, 5.247564316322378, 4.3194501864211575, 2.912291401980419, 6.118528947223795, 1.3949386065204183, 2.9214464853521816, 3.66361835606538, 4.56069983752664, 7.85175961393095, 1.9967378215835904, 5.14234438405513, 8.504421146417164, 2.088767706800416, 1.3031570018846792, 8.12168728231102, 5.16167519491869, 8.87785748876001, 9.873558174800375, 4.633151506086349, 0.7190915198954935, 4.140022554841212, 2.918760687989861, 3.142989801916041, 3.868788534230207, 0.4470353984896789, 7.116790414754402, 8.990911162659834, 7.007119440238783, 5.624138381999047, 6.76067084021565, 3.883892685767925, 3.58645010809966, 9.085955020434004], "y": [61.80012969975683, 78.73906091351787, 70.48609571799993, 68.58333009576938, 54.38947220216001, 53.7772878291532, 50.13939085263499, 80.08221261610163, 64.49121061642705, 72.60906729040437, 49.45487334140803, 80.4395260200674, 75.03156091896634, 55.58780964481221, 54.354578935524774, 55.95421309121784, 54.6820349382979, 69.66634592969637, 62.38520226483979, 61.8559534886734, 67.94800296679476, 53.22476663315619, 60.03752608566032, 64.63280671927132, 59.689060281378535, 78.02727248338617, 49.72821196973287, 64.9080900854904, 73.9345471433459, 53.21330679777131, 53.73463325876348, 78.11090977187613, 67.0268285091743, 74.76335420341864, 78.98471012903892, 63.50039929373398, 52.82209933476746, 66.4780023052012, 56.45234436878183, 60.94211159472224, 61.87195122034747, 55.83363154871973, 71.0769800280368, 79.50963290837518, 69.58522348638993, 65.04416704668687, 70.5609155313511, 58.211290148369636, 59.78560343219218, 76.05471456487646], "mode": "markers", "marker": {"symbol": "circle", "size": 6}, "name": "Sales"}, {"type": "scatter", "x": [3.745401188473625, 9.50714306409916, 7.319939418114051, 5.986584841970366, 1.5601864044243652, 1.5599452033620265, 0.5808361216819946, 8.66176145774935, 6.011150117432088, 7.080725777960457, 0.20584494295802448, 9.69909852161994, 8.324426408004217, 2.1233911067827616, 1.8182496720710063, 1.834045098534338, 3.042422429595377, 5.247564316322378, 4.3194501864211575, 2.912291401980419, 6.118528947223795, 1.3949386065204183, 2.9214464853521816, 3.66361835606538, 4.56069983752664, 7.85175961393095, 1.9967378215835904, 5.14234438405513, 8.504421146417164, 2.088767706800416, 1.3031570018846792, 8.12168728231102, 5.16167519491869, 8.87785748876001, 9.873558174800375, 4.633151506086349, 0.7190915198954935, 4.140022554841212, 2.918760687989861, 3.142989801916041, 3.868788534230207, 0.4470353984896789, 7.116790414754402, 8.990911162659834, 7.007119440238783, 5.624138381999047, 6.76067084021565, 3.883892685767925, 3.58645010809966, 9.085955020434004], "y": [61.85135113945768, 78.66111389286195, 72.22367396958712, 68.34536419720863, 55.48622553168913, 55.47956737456606, 52.63624878441603, 76.19520694074588, 68.41821113724004, 71.53464346313656, 51.54927987360797, 79.22077031648876, 75.19940379418376, 57.123833531469575, 56.23593427727789, 56.28226815266064, 59.78356344453186, 66.19082202430424, 63.48383083595413, 59.403221284528345, 68.7342961570318, 54.99994883531345, 59.42984618399993, 61.5901398129431, 64.1912202726002, 73.80305988816436, 56.75294356504344, 65.87749209909856, 75.71359066509951, 56.99831046545852, 54.73139371256026, 74.60601503929507, 65.9329888924196, 76.81768932432615, 79.72698837323453, 64.4043782109791, 53.03570568773341, 62.97517962249437, 59.4196840711268, 60.08552933527426, 62.17983480020782, 52.24702042271916, 71.6396573068655, 77.14719120888663, 71.3161019887193, 67.41130045823364, 70.64629440655784, 62.2315176255097, 61.36550976062345, 77.4195263461566}]}]}
Scatter plot of Sales vs. Advertising Spend with the OLS regression line overlaid. The line appears to capture the central trend of the data well.
As mentioned earlier, linear regression relies on several assumptions for the results (especially the p-values and confidence intervals) to be reliable. These include:
A common way to check homoscedasticity (constant variance) and linearity visually is by plotting the residuals (ϵ^=y−y^) against the fitted values (y^).
# Calculate residuals (using Statsmodels results)
residuals = results_sm.resid
fitted_values = results_sm.fittedvalues
# Create residual plot using Plotly
fig_resid = go.Figure()
fig_resid.add_trace(go.Scatter(x=fitted_values, y=residuals, mode='markers',
marker=dict(color='#1c7ed6', size=6), # Blue color
name='Residuals'))
# Add a horizontal line at zero
fig_resid.add_hline(y=0, line_width=2, line_dash="dash", line_color="#868e96") # Gray dash line
fig_resid.update_layout(
title='Residuals vs. Fitted Values',
xaxis_title='Fitted Values (Predicted Sales)',
yaxis_title='Residuals',
template='plotly_white',
title_x=0.5,
margin=dict(l=40, r=40, t=50, b=40),
width=600,
height=400,
showlegend=False
)
# Convert to JSON for web embedding
resid_json = fig_resid.to_json(pretty=False)
{"layout": {"template": {"layout": {"font": {"color": "#2a3f5f"}, "hoverlabel": {"align": "left"}, "hovermode": "closest", "paper_bgcolor": "white", "plot_bgcolor": "white", "colorway": ["#636efa", "#ef553b", "#00cc96", "#ab63fa", "#ffa15a", "#19d3f3", "#ff6692", "#b6e880", "#ff97ff", "#fecb52"], "colorscale": {"sequential": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}, "xaxis": {"gridcolor": "#EBF0F8", "linecolor": "#EBF0F8", "ticks": "", "title": {"standoff": 15}, "zerolinecolor": "#EBF0F8", "automargin": true}, "yaxis": {"gridcolor": "#EBF0F8", "linecolor": "#EBF0F8", "ticks": "", "title": {"standoff": 15}, "zerolinecolor": "#EBF0F8", "automargin": true}, "coloraxis": {"colorbar": {"ticks": ""}}, "shapedefaults": {"line": {"color": "#2a3f5f"}}, "annotationdefaults": {"arrowhead": 0, "arrowwidth": 1}, "bargap": 0.2}}, "data": [{"type": "scattergl"}]}, "title": {"text": "Residuals vs. Fitted Values", "x": 0.5}, "xaxis": {"title": {"text": "Fitted Values (Predicted Sales)"}, "anchor": "y", "domain": [0.0, 1.0]}, "yaxis": {"title": {"text": "Residuals"}, "anchor": "x", "domain": [0.0, 1.0]}, "shapes": [{"type": "line", "y0": 0, "y1": 0, "xref": "paper", "x0": 0, "x1": 1, "line": {"width": 2, "dash": "dash", "color": "#868e96"}}], "margin": {"l": 40, "r": 40, "t": 50, "b": 40}, "width": 600, "height": 400, "showlegend": false}, "data": [{"type": "scatter", "x": [61.85135113945768, 78.66111389286195, 72.22367396958712, 68.34536419720863, 55.48622553168913, 55.47956737456606, 52.63624878441603, 76.19520694074588, 68.41821113724004, 71.53464346313656, 51.54927987360797, 79.22077031648876, 75.19940379418376, 57.123833531469575, 56.23593427727789, 56.28226815266064, 59.78356344453186, 66.19082202430424, 63.48383083595413, 59.403221284528345, 68.7342961570318, 54.99994883531345, 59.42984618399993, 61.5901398129431, 64.1912202726002, 73.80305988816436, 56.75294356504344, 65.87749209909856, 75.71359066509951, 56.99831046545852, 54.73139371256026, 74.60601503929507, 65.9329888924196, 76.81768932432615, 79.72698837323453, 64.4043782109791, 53.03570568773341, 62.97517962249437, 59.4196840711268, 60.08552933527426, 62.17983480020782, 52.24702042271916, 71.6396573068655, 77.14719120888663, 71.3161019887193, 67.41130045823364, 70.64629440655784, 62.2315176255097, 61.36550976062345, 77.4195263461566], "y": [-0.05122143970085383, 0.07794702065592098, -1.7375782515871908, 0.23796589856074146, -1.0967533295291218, -1.702280175412857, -2.496857931781039, 3.887005675355747, -3.9270005208129914, 1.0744238272678156, -2.094406532199939, 1.2187557035786475, -0.1678428752174259, -1.5360238866573665, -1.8813553417531166, -0.3280550614427996, -5.10152850623396, 3.4755239053921316, -1.0986285711143368, 2.452732204145054, -0.7862931902370428, -1.775182202157264, 0.6076798996603935, 3.0426668963282184, -4.502160001221672, 4.224212595197305, -7.024731595310572, -0.9693960136081207, -1.779043521753595, -3.78500366768721, -0.996760453796781, 3.504894732581057, 1.093839616754701, -2.0543351209075143, -0.7422782441956112, -0.9039790002451247, -0.2136063529659469, 3.502822682706834, -2.967339702344969, 0.8565822594479808, -0.3078835798603523, 3.5866111259998894, -0.5626772788286977, 2.3624416994885507, -1.7308785023293724, -2.367133411546776, -0.08537887519999984, -4.020227477140065, -1.5799063284312698, -1.3648117812801389], "mode": "markers", "marker": {"color": "#1c7ed6", "size": 6}, "name": "Residuals"}]}
Plot of residuals (actual Sales - predicted Sales) versus the fitted (predicted) Sales values. Ideally, points should scatter randomly around the horizontal line at zero with no discernible pattern.
In an ideal residual plot, the points should be randomly scattered around the horizontal line at zero, showing no clear pattern (like a curve or a funnel shape). Our plot looks reasonably random, suggesting the linearity and homoscedasticity assumptions might hold. A funnel shape (variance increasing or decreasing with fitted values) would indicate heteroscedasticity. A curved pattern would suggest the linear model might not be the best fit.
Formal tests and other plots (like Q-Q plots for normality) are often used for a more rigorous assessment of these assumptions.
In this hands-on section, we've walked through the process of:
sm.OLS
) and Scikit-learn (LinearRegression
) to estimate the regression coefficients using ordinary least squares.This practical application demonstrates how to translate the theoretical concepts of simple linear regression into actionable code and analysis. Understanding these steps is fundamental before moving on to more complex models like multiple linear regression, where we use more than one predictor variable.
© 2025 ApX Machine Learning