Linear and Logistic Regression represent foundational algorithms in supervised machine learning, widely used for regression and classification tasks, respectively. Their simplicity, interpretability, and computational efficiency make them excellent starting points for many modeling problems. We will use the popular scikit-learn library in Python to implement these models.
Linear Regression aims to model the relationship between a dependent variable (target) and one or more independent variables (features) by fitting a linear equation to the observed data. The goal is to find the best-fitting straight line (or hyperplane in higher dimensions) through the data points. The equation for a simple linear regression (one feature) is y=β0+β1x, where y is the target, x is the feature, β0 is the intercept, and β1 is the coefficient for the feature. For multiple features, this extends to:
y=β0+β1x1+β2x2+...+βnxnScikit-learn provides the LinearRegression
class within the sklearn.linear_model
module. Let's walk through a basic implementation.
First, we need some data. We can generate synthetic regression data using scikit-learn's make_regression
function.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.datasets import make_regression
# Generate synthetic data
X, y = make_regression(n_samples=100, n_features=1, noise=15, random_state=42)
# Convert to DataFrame for easier handling (optional)
X_df = pd.DataFrame(X, columns=['Feature'])
y_s = pd.Series(y, name='Target')
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print(f"Training set shape: X={X_train.shape}, y={y_train.shape}")
print(f"Testing set shape: X={X_test.shape}, y={y_test.shape}")
Now, we instantiate the LinearRegression
model and train it using the fit
method on our training data.
# Initialize the Linear Regression model
lr_model = LinearRegression()
# Train the model
lr_model.fit(X_train, y_train)
# Print the learned coefficients
print(f"Intercept (beta_0): {lr_model.intercept_:.2f}")
print(f"Coefficient (beta_1): {lr_model.coef_[0]:.2f}")
With the model trained, we can make predictions on new, unseen data (our test set) using the predict
method.
# Make predictions on the test set
y_pred = lr_model.predict(X_test)
Finally, we evaluate the model's performance. Common metrics for regression include Mean Squared Error (MSE) and the R-squared (R2) score. MSE measures the average squared difference between the actual and predicted values (lower is better), while R2 represents the proportion of the variance in the dependent variable that is predictable from the independent variables (closer to 1 is better).
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R-squared (R2 Score): {r2:.2f}")
We can visualize the results to see how well the regression line fits the test data.
{"layout": {"title": "Linear Regression Fit", "xaxis": {"title": "Feature"}, "yaxis": {"title": "Target"}, "legend": {"title": {"text": "Data"}}, "template": "plotly_white"}, "data": [{"type": "scatter", "x": X_test.flatten().tolist(), "y": y_test.tolist(), "mode": "markers", "name": "Actual Test Data", "marker": {"color": "#339af0", "size": 8}}, {"type": "scatter", "x": X_test.flatten().tolist(), "y": y_pred.tolist(), "mode": "lines", "name": "Regression Line", "line": {"color": "#f03e3e", "width": 3}}]}
Scatter plot showing the actual data points from the test set and the fitted linear regression line.
Despite its name, Logistic Regression is used for classification tasks, typically binary classification (predicting one of two outcomes). It models the probability that an input belongs to a particular class using the logistic function, also known as the sigmoid function. The sigmoid function maps any real-valued number into a value between 0 and 1.
The formula for the probability of the positive class (often denoted as 1) is:
P(y=1∣X)=1+e−z1where z=β0+β1x1+β2x2+...+βnxn. The output P(y=1∣X) is the estimated probability. A threshold (commonly 0.5) is then used to convert this probability into a class prediction (e.g., if probability > 0.5, predict class 1, otherwise predict class 0).
Scikit-learn provides the LogisticRegression
class, also in sklearn.linear_model
. Let's implement it.
We'll start by generating synthetic classification data using make_classification
.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.datasets import make_classification
import plotly.graph_objects as go
# Generate synthetic classification data
X, y = make_classification(n_samples=100, n_features=2, n_informative=2, n_redundant=0,
n_clusters_per_class=1, flip_y=0.1, random_state=42)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print(f"Training set shape: X={X_train.shape}, y={y_train.shape}")
print(f"Testing set shape: X={X_test.shape}, y={y_test.shape}")
Next, instantiate and train the LogisticRegression
model.
# Initialize the Logistic Regression model
log_reg_model = LogisticRegression(random_state=42)
# Train the model
log_reg_model.fit(X_train, y_train)
# Print the learned coefficients and intercept
print(f"Intercept: {log_reg_model.intercept_[0]:.2f}")
print(f"Coefficients: {log_reg_model.coef_[0][0]:.2f}, {log_reg_model.coef_[0][1]:.2f}")
Make predictions on the test set. The predict
method outputs the predicted class labels directly, while predict_proba
gives the probability estimates for each class.
# Make predictions
y_pred_class = log_reg_model.predict(X_test)
# Predict probabilities
y_pred_proba = log_reg_model.predict_proba(X_test)
# Display the first 5 predictions and their probabilities
print("First 5 Predicted Classes:", y_pred_class[:5])
print("First 5 Predicted Probabilities (Class 0, Class 1):\n", y_pred_proba[:5].round(3))
Evaluate the model using classification metrics. Accuracy is a common starting point, but the confusion matrix provides a more detailed view of performance, showing true positives, true negatives, false positives, and false negatives. The classification_report
provides precision, recall, and F1-score, which will be discussed in more detail later.
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred_class)
conf_matrix = confusion_matrix(y_test, y_pred_class)
class_report = classification_report(y_test, y_pred_class)
print(f"Accuracy: {accuracy:.2f}")
print("\nConfusion Matrix:\n", conf_matrix)
print("\nClassification Report:\n", class_report)
To visualize the core component of logistic regression, the sigmoid function, we can plot it.
{"layout": {"title": "Sigmoid (Logistic) Function", "xaxis": {"title": "z (Linear Input)", "range": [-10, 10]}, "yaxis": {"title": "Sigmoid(z) (Probability)", "range": [-0.1, 1.1]}, "template": "plotly_white"}, "data": [{"type": "scatter", "x": np.linspace(-10, 10, 100).tolist(), "y": (1 / (1 + np.exp(-np.linspace(-10, 10, 100)))).tolist(), "mode": "lines", "name": "Sigmoid", "line": {"color": "#7048e8", "width": 3}}]}
The sigmoid function transforms a linear combination of inputs (z) into a probability value between 0 and 1.
Linear models can sometimes overfit, especially when you have many features or highly correlated features. Regularization is a technique used to prevent overfitting by adding a penalty term to the model's loss function. This penalty discourages overly complex models with large coefficient values.
The two most common types of regularization are:
Lasso
class for regression.Ridge
class for regression.Logistic Regression in scikit-learn incorporates regularization by default (penalty='l2'
). You can change the penalty type ('l1'
, 'elasticnet'
, 'none'
) and control its strength using the C
parameter (which is the inverse of the regularization strength; smaller C
values mean stronger regularization).
# Example: Logistic Regression with L1 penalty
log_reg_l1 = LogisticRegression(penalty='l1', C=0.5, solver='liblinear', random_state=42)
log_reg_l1.fit(X_train, y_train)
print(f"L1 Coefficients: {log_reg_l1.coef_[0][0]:.2f}, {log_reg_l1.coef_[0][1]:.2f}")
# Example: Ridge Regression (Linear Regression with L2 penalty)
from sklearn.linear_model import Ridge
ridge_model = Ridge(alpha=1.0) # alpha is the regularization strength
ridge_model.fit(X_train_reg, y_train_reg) # Assuming X_train_reg, y_train_reg are for regression
print(f"Ridge Coefficients: {ridge_model.coef_}")
(Note: The Ridge example requires regression data X_train_reg
, y_train_reg
)
Strengths:
Weaknesses:
Linear and Logistic Regression are fundamental tools in a data scientist's toolkit. Mastering their implementation and understanding their characteristics provides a solid foundation before moving on to the more complex tree-based and ensemble methods discussed next.
© 2025 ApX Machine Learning