The simple linear regression model, $y = \beta_0 + \beta_1 x + \epsilon$, involves estimating its coefficients \beta_0 (intercept) and \beta_1 (slope) using the method of least squares. Model evaluation is typically done using metrics such as $R^2$ and Mean Squared Error (MSE). We will now fit such a model to data and interpret the results using Python.We'll use common Python libraries: Pandas for data handling, Matplotlib/Seaborn or Plotly for visualization, and Statsmodels and Scikit-learn for building the regression model itself. Statsmodels often provides more detailed statistical summaries useful for inference, while Scikit-learn is widely used in machine learning pipelines for prediction tasks. We'll look at both.Setting Up the Environment and DataFirst, ensure you have the necessary libraries installed. If not, you can typically install them using pip:pip install pandas numpy statsmodels scikit-learn plotlyNow, let's import them and create some sample data. Imagine we have data tracking advertising spending (in thousands of dollars) and corresponding sales (in thousands of units).import pandas as pd import numpy as np import plotly.express as px import plotly.graph_objects as go from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score import statsmodels.api as sm # Generate synthetic data for reproducibility np.random.seed(42) # for reproducible results advertising_spend = np.random.rand(50) * 10 # Spending from 0 to 10 thousand dollars # Sales = roughly 50 + 3*Spend + some noise sales = 50 + 3 * advertising_spend + np.random.randn(50) * 5 # Create a Pandas DataFrame data = pd.DataFrame({'AdvertisingSpend': advertising_spend, 'Sales': sales}) print(data.head()) # Output: # AdvertisingSpend Sales # 0 3.745401 61.800130 # 1 9.507143 78.739061 # 2 7.319939 70.486096 # 3 5.986585 68.583330 # 4 1.560186 54.389472Exploratory Data VisualizationBefore fitting a model, it's always a good idea to visualize the relationship between the variables. A scatter plot is perfect for this.# Create scatter plot using Plotly Express fig_scatter = px.scatter(data, x='AdvertisingSpend', y='Sales', title='Sales vs. Advertising Spend', labels={'AdvertisingSpend': 'Advertising Spend ($ Thousands)', 'Sales': 'Sales ($ Thousands)'}, template='plotly_white') # Use a clean template # Enhance layout for web display fig_scatter.update_layout( title_x=0.5, # Center title margin=dict(l=40, r=40, t=50, b=40), # Adjust margins width=600, # Set width height=400 # Set height ) # Show plot (in a notebook/script) or convert to JSON for web embedding # fig_scatter.show() # For web embedding: scatter_json = fig_scatter.to_json(pretty=False){"layout": {"title": {"text": "Sales vs. Advertising Spend", "x": 0.5}, "xaxis": {"title": {"text": "Advertising Spend ($ Thousands)"}}, "yaxis": {"title": {"text": "Sales ($ Thousands)"}}, "margin": {"l": 40, "r": 40, "t": 50, "b": 40}, "width": 600, "height": 400, "plot_bgcolor": "white", "paper_bgcolor": "white", "font": {"color": "#2a3f5f"}}, "data": [{"type": "scatter", "x": [3.75, 9.51, 7.32, 5.99, 1.56, 1.56, 0.58, 8.66, 6.01, 7.08, 0.21, 9.70, 8.32, 2.12, 1.82, 1.83, 3.04, 5.25, 4.32, 2.91, 6.12, 1.39, 2.92, 3.66, 4.56, 7.85, 2.00, 5.14, 8.50, 2.09, 1.30, 8.12, 5.16, 8.88, 9.87, 4.63, 0.72, 4.14, 2.92, 3.14, 3.87, 0.45, 7.12, 8.99, 7.01, 5.62, 6.76, 3.88, 3.59, 9.09], "y": [61.80, 78.74, 70.49, 68.58, 54.39, 53.78, 50.14, 80.08, 64.49, 72.61, 49.45, 80.44, 75.03, 55.59, 54.35, 55.95, 54.68, 69.67, 62.39, 61.86, 67.95, 53.22, 60.04, 64.63, 59.69, 78.03, 49.73, 64.91, 73.93, 53.21, 53.73, 78.11, 67.03, 74.76, 78.98, 63.50, 52.82, 66.48, 56.45, 60.94, 61.87, 55.83, 71.08, 79.51, 69.59, 65.04, 70.56, 58.21, 59.79, 76.05], "mode": "markers", "marker": {"symbol": "circle", "size": 6}}]}Scatter plot showing the relationship between Advertising Spend and Sales. A positive linear trend appears visible.The scatter plot suggests a positive linear relationship: as advertising spend increases, sales tend to increase as well. This visual confirmation supports the use of a linear regression model.Fitting the Linear Regression Model with StatsmodelsStatsmodels provides a class OLS (Ordinary Least Squares) that we can use. It requires us to explicitly add a constant term (the intercept $\beta_0$) to our predictor variable(s).# Prepare the data for Statsmodels X = data['AdvertisingSpend'] y = data['Sales'] X = sm.add_constant(X) # Add an intercept term to the predictor # Fit the OLS model model_sm = sm.OLS(y, X) results_sm = model_sm.fit() # Print the model summary print(results_sm.summary())This summary output is rich in information: OLS Regression Results ============================================================================== Dep. Variable: Sales R-squared: 0.891 Model: OLS Adj. R-squared: 0.888 Method: Least Squares F-statistic: 391.6 Date: Wed, 15 May 2024 Prob (F-statistic): 1.33e-24 Time: 12:00:00 Log-Likelihood: -130.91 No. Observations: 50 AIC: 265.8 Df Residuals: 48 BIC: 269.6 Df Model: 1 Covariance Type: nonrobust ==================================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------------ const 50.9489 0.865 58.928 0.000 49.211 52.687 AdvertisingSpend 2.9094 0.147 19.790 0.000 2.614 3.205 ============================================================================== Omnibus: 0.513 Durbin-Watson: 2.205 Prob(Omnibus): 0.774 Jarque-Bera (JB): 0.621 Skew: 0.178 Prob(JB): 0.733 Kurtosis: 2.595 Cond. No. 11.7 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.Interpreting the Statsmodels Summary:Dep. Variable: Confirms 'Sales' is our dependent variable.Model: OLS (Ordinary Least Squares).R-squared: 0.891. This means approximately 89.1% of the variance in Sales can be explained by AdvertisingSpend using this linear model. This is a good fit for our synthetic data.Adj. R-squared: 0.888. Similar to R-squared but adjusted for the number of predictors. Useful when comparing models with different numbers of predictors.F-statistic: 391.6. Tests the overall significance of the model.Prob (F-statistic): 1.33e-24. This is the p-value associated with the F-statistic. A very small value (typically < 0.05) indicates that the model as a whole is statistically significant. Our model is highly significant.coef:const: 50.95. This is the estimated intercept ($\hat{\beta}_0$). It suggests that if advertising spend were zero, the expected sales would be approximately 50.95 thousand units.AdvertisingSpend: 2.91. This is the estimated slope ($\hat{\beta}_1$). It indicates that for each additional thousand dollars spent on advertising, sales are expected to increase by approximately 2.91 thousand units.std err: Standard errors of the coefficient estimates, measuring their precision.t: t-statistic values for each coefficient, testing the null hypothesis that the coefficient is zero.P>|t|: p-values associated with the t-statistics. The very small p-values (0.000) for both the constant and AdvertisingSpend suggest that both are statistically significant predictors in the model.[0.025 0.975]: 95% confidence intervals for the coefficients. We are 95% confident that the true intercept lies between 49.21 and 52.69, and the true slope for AdvertisingSpend lies between 2.61 and 3.21.The bottom part of the summary includes diagnostic tests related to the model assumptions (like normality and independence of residuals), which we'll touch upon later.Fitting the Linear Regression Model with Scikit-learnNow let's perform the same task using Scikit-learn. It's more focused on the prediction aspect.# Prepare data for Scikit-learn # X needs to be a 2D array (or DataFrame) X_sk = data[['AdvertisingSpend']] # Note the double brackets y_sk = data['Sales'] # Initialize and fit the model model_sk = LinearRegression() model_sk.fit(X_sk, y_sk) # Get the coefficients intercept = model_sk.intercept_ # Beta_0 coefficient = model_sk.coef_[0] # Beta_1 print(f"Scikit-learn Intercept (beta_0): {intercept:.4f}") print(f"Scikit-learn Coefficient (beta_1): {coefficient:.4f}") # Output: # Scikit-learn Intercept (beta_0): 50.9489 # Scikit-learn Coefficient (beta_1): 2.9094 # Make predictions on the training data y_pred_sk = model_sk.predict(X_sk) # Calculate evaluation metrics mse = mean_squared_error(y_sk, y_pred_sk) r2 = r2_score(y_sk, y_pred_sk) # Same as model_sk.score(X_sk, y_sk) print(f"Scikit-learn Mean Squared Error (MSE): {mse:.4f}") print(f"Scikit-learn R-squared (R2): {r2:.4f}") # Output: # Scikit-learn Mean Squared Error (MSE): 24.2842 # Scikit-learn R-squared (R2): 0.8906You'll notice the coefficients ($\hat{\beta}_0 \approx 50.95$, $\hat{\beta}_1 \approx 2.91$) and the $R^2$ value (0.891) are essentially identical to those obtained from Statsmodels. Scikit-learn provides easy access to the coefficients and common evaluation metrics like MSE and $R^2$, but doesn't automatically generate the detailed statistical summary that Statsmodels does.Visualizing the Fitted LineLet's visualize how well our fitted line represents the data. We can use the coefficients we found to plot the regression line on top of the scatter plot.# Create the scatter plot again fig_line = px.scatter(data, x='AdvertisingSpend', y='Sales', title='Sales vs. Advertising Spend with Fitted Line', labels={'AdvertisingSpend': 'Advertising Spend ($ Thousands)', 'Sales': 'Sales ($ Thousands)'}, template='plotly_white') # Add the regression line # Use coefficients from either model (they are the same) # Line equation: y = intercept + coefficient * x fig_line.add_trace(go.Scatter(x=data['AdvertisingSpend'], y=y_pred_sk, # Use predictions as y values for the line mode='lines', name='Fitted Line', line=dict(color='#fa5252', width=2))) # Use a red color from the palette # Enhance layout fig_line.update_layout( title_x=0.5, margin=dict(l=40, r=40, t=50, b=40), width=600, height=400, showlegend=True ) # Convert to JSON for web embedding line_json = fig_line.to_json(pretty=False){"layout": {"title": {"text": "Sales vs. Advertising Spend with Fitted Line", "x": 0.5}, "xaxis": {"title": {"text": "Advertising Spend ($ Thousands)"}}, "yaxis": {"title": {"text": "Sales ($ Thousands)"}}, "margin": {"l": 40, "r": 40, "t": 50, "b": 40}, "width": 600, "height": 400, "showlegend": true, "plot_bgcolor": "white", "paper_bgcolor": "white", "font": {"color": "#2a3f5f"}}, "data": [{"type": "scatter", "x": [3.75, 9.51, 7.32, 5.99, 1.56, 1.56, 0.58, 8.66, 6.01, 7.08, 0.21, 9.70, 8.32, 2.12, 1.82, 1.83, 3.04, 5.25, 4.32, 2.91, 6.12, 1.39, 2.92, 3.66, 4.56, 7.85, 2.00, 5.14, 8.50, 2.09, 1.30, 8.12, 5.16, 8.88, 9.87, 4.63, 0.72, 4.14, 2.92, 3.14, 3.87, 0.45, 7.12, 8.99, 7.01, 5.62, 6.76, 3.88, 3.59, 9.09], "y": [61.80, 78.74, 70.49, 68.58, 54.39, 53.78, 50.14, 80.08, 64.49, 72.61, 49.45, 80.44, 75.03, 55.59, 54.35, 55.95, 54.68, 69.67, 62.39, 61.86, 67.95, 53.22, 60.04, 64.63, 59.69, 78.03, 49.73, 64.91, 73.93, 53.21, 53.73, 78.11, 67.03, 74.76, 78.98, 63.50, 52.82, 66.48, 56.45, 60.94, 61.87, 55.83, 71.08, 79.51, 69.59, 65.04, 70.56, 58.21, 59.79, 76.05], "mode": "markers", "marker": {"symbol": "circle", "size": 6}, "name": "Sales"}, {"type": "scatter", "x": [0.21, 0.45, 0.58, 0.72, 1.30, 1.39, 1.56, 1.56, 1.82, 1.83, 2.00, 2.09, 2.12, 2.91, 2.92, 3.04, 3.14, 3.59, 3.66, 3.75, 3.87, 3.88, 4.14, 4.32, 4.56, 4.63, 5.14, 5.16, 5.25, 5.62, 5.99, 6.01, 6.12, 6.76, 7.01, 7.08, 7.12, 7.32, 7.85, 8.12, 8.32, 8.50, 8.66, 8.88, 8.99, 9.09, 9.51, 9.70, 9.87], "y": [51.55, 52.25, 52.64, 53.04, 54.73, 55.00, 55.49, 55.48, 56.24, 56.28, 56.75, 57.00, 57.12, 59.40, 59.43, 59.78, 60.09, 61.37, 61.59, 61.85, 62.18, 62.23, 62.98, 63.48, 64.19, 64.40, 65.88, 65.93, 66.19, 67.41, 68.35, 68.42, 68.73, 70.65, 71.32, 71.53, 71.64, 72.22, 73.80, 74.61, 75.20, 75.71, 76.20, 76.82, 77.15, 77.42, 78.66, 79.22, 79.73], "mode": "lines", "name": "Fitted Line", "line": {"color": "#fa5252", "width": 2}}]}Scatter plot of Sales vs. Advertising Spend with the OLS regression line overlaid. The line appears to capture the central trend of the data well.Checking Model Assumptions (Briefly)As mentioned earlier, linear regression relies on several assumptions for the results (especially the p-values and confidence intervals) to be reliable. These include:Linearity: The relationship between X and y is linear. (We checked this visually).Independence: The errors (residuals) are independent of each other. (Often assessed based on data collection context or specific tests like Durbin-Watson from the Statsmodels summary).Homoscedasticity: The errors have constant variance across all levels of X.Normality: The errors are normally distributed.A common way to check homoscedasticity (constant variance) and linearity visually is by plotting the residuals ($\hat{\epsilon} = y - \hat{y}$) against the fitted values ($\hat{y}$).# Calculate residuals (using Statsmodels results) residuals = results_sm.resid fitted_values = results_sm.fittedvalues # Create residual plot using Plotly fig_resid = go.Figure() fig_resid.add_trace(go.Scatter(x=fitted_values, y=residuals, mode='markers', marker=dict(color='#1c7ed6', size=6), # Blue color name='Residuals')) # Add a horizontal line at zero fig_resid.add_hline(y=0, line_width=2, line_dash="dash", line_color="#868e96") # Gray dash line fig_resid.update_layout( title='Residuals vs. Fitted Values', xaxis_title='Fitted Values (Predicted Sales)', yaxis_title='Residuals', template='plotly_white', title_x=0.5, margin=dict(l=40, r=40, t=50, b=40), width=600, height=400, showlegend=False ) # Convert to JSON for web embedding resid_json = fig_resid.to_json(pretty=False){"layout": {"title": {"text": "Residuals vs. Fitted Values", "x": 0.5}, "xaxis": {"title": {"text": "Fitted Values (Predicted Sales)"}}, "yaxis": {"title": {"text": "Residuals"}}, "shapes": [{"type": "line", "y0": 0, "y1": 0, "xref": "paper", "x0": 0, "x1": 1, "line": {"width": 2, "dash": "dash", "color": "#868e96"}}], "margin": {"l": 40, "r": 40, "t": 50, "b": 40}, "width": 600, "height": 400, "showlegend": false, "plot_bgcolor": "white", "paper_bgcolor": "white", "font": {"color": "#2a3f5f"}}, "data": [{"type": "scatter", "x": [61.85, 78.66, 72.22, 68.35, 55.49, 55.48, 52.64, 76.20, 68.42, 71.53, 51.55, 79.22, 75.20, 57.12, 56.24, 56.28, 59.78, 66.19, 63.48, 59.40, 68.73, 55.00, 59.43, 61.59, 64.19, 73.80, 56.75, 65.88, 75.71, 57.00, 54.73, 74.61, 65.93, 76.82, 79.73, 64.40, 53.04, 62.98, 59.42, 60.09, 62.18, 52.25, 71.64, 77.15, 71.32, 67.41, 70.65, 62.23, 61.37, 77.42], "y": [-0.05, 0.08, -1.74, 0.24, -1.10, -1.70, -2.50, 3.89, -3.93, 1.07, -2.09, 1.22, -0.17, -1.54, -1.88, -0.33, -5.10, 3.48, -1.10, 2.45, -0.79, -1.78, 0.61, 3.04, -4.50, 4.22, -7.02, -0.97, -1.78, -3.79, -1.00, 3.50, 1.09, -2.05, -0.74, -0.90, -0.21, 3.50, -2.97, 0.86, -0.31, 3.59, -0.56, 2.36, -1.73, -2.37, -0.09, -4.02, -1.58, -1.36], "mode": "markers", "marker": {"color": "#1c7ed6", "size": 6}, "name": "Residuals"}]}Plot of residuals (actual Sales - predicted Sales) versus the fitted (predicted) Sales values. Ideally, points should scatter randomly around the horizontal line at zero with no discernible pattern.In an ideal residual plot, the points should be randomly scattered around the horizontal line at zero, showing no clear pattern (like a curve or a funnel shape). Our plot looks reasonably random, suggesting the linearity and homoscedasticity assumptions might hold. A funnel shape (variance increasing or decreasing with fitted values) would indicate heteroscedasticity. A curved pattern would suggest the linear model might not be the best fit.Formal tests and other plots (like Q-Q plots for normality) are often used for a more rigorous assessment of these assumptions.Summary of Practical StepsIn this hands-on section, we've walked through the process of:Loading and Visualizing Data: Using Pandas and Plotly to prepare data and visually inspect the relationship between variables.Fitting the Model: Employing both Statsmodels (sm.OLS) and Scikit-learn (LinearRegression) to estimate the regression coefficients using ordinary least squares.Interpreting Results: Analyzing the output, particularly the coefficients, $R^2$, MSE, and statistical significance provided by Statsmodels.Evaluating the Fit: Calculating performance metrics ($R^2$, MSE) and visualizing the fitted line against the actual data.Checking Assumptions (Briefly): Introducing the concept of residual analysis to check for potential issues like non-linearity or non-constant variance.This practical application demonstrates how to translate the theoretical concepts of simple linear regression into actionable code and analysis. Understanding these steps is fundamental before moving on to more complex models like multiple linear regression, where we use more than one predictor variable.