While the name "Logistic Regression" includes "Regression," it's actually one of the most fundamental and widely used algorithms for classification tasks. Building on our understanding of linear models from the previous chapter, Logistic Regression adapts the linear approach to predict probabilities for categorical outcomes. Instead of predicting a continuous value, it estimates the probability that an instance belongs to a particular class.
Recall that a standard linear regression model predicts an output y using a linear combination of input features x:
y^=wTx+b
The output y^ can range from −∞ to +∞. This is suitable for regression, but for classification, we need an output that represents a probability, constrained between 0 and 1. How can we map the potentially unbounded output of the linear equation to this range?
This is where the sigmoid function, also known as the logistic function, comes into play. It takes any real-valued number z and squashes it into the range (0, 1). The sigmoid function is defined as:
g(z)=1+e−z1
Here, z is typically the output of the linear part of the model, so z=wTx+b. The function g(z) gives us the estimated probability P(y=1∣x;w,b) that the output y is 1, given the input features x and the learned parameters w (weights) and b (bias).
Let's visualize the sigmoid function:
The sigmoid function maps any input value z to an output between 0 and 1. As z becomes very large positive, g(z) approaches 1. As z becomes very large negative, g(z) approaches 0. When z=0, g(z)=0.5.
The output of the sigmoid function gives us a probability. To make a concrete class prediction (e.g., 0 or 1 for binary classification), we need a decision threshold. A common choice is 0.5.
Looking at the sigmoid function plot, we see that g(z)≥0.5 happens when z≥0. Since z=wTx+b, our decision rule becomes:
The equation wTx+b=0 defines the decision boundary. For Logistic Regression with this linear form of z, the decision boundary is a line (in 2D), a plane (in 3D), or a hyperplane (in higher dimensions) that separates the two classes.
An example scatter plot showing two classes of data points and a linear decision boundary learned by Logistic Regression. Points on one side are classified as Class 0, and points on the other side as Class 1.
Just like linear regression needs a cost function to measure how well the line fits the data, logistic regression needs one too. However, using the Mean Squared Error (MSE) from linear regression with the sigmoid function results in a non-convex cost function with many local minima. This makes it difficult for optimization algorithms like gradient descent to find the global minimum reliably.
Instead, Logistic Regression uses a cost function called Log Loss (also known as Binary Cross-Entropy). For a single training example (x(i),y(i)), where y(i) is the true label (0 or 1) and h(x(i))=g(wTx(i)+b) is the predicted probability for class 1, the cost is:
Cost(h(x(i)),y(i))=−[y(i)log(h(x(i)))+(1−y(i))log(1−h(x(i)))]
The total cost function J(w,b) over all m training examples is the average of these individual costs:
J(w,b)=m1∑i=1mCost(h(x(i)),y(i)) J(w,b)=−m1∑i=1m[y(i)log(h(x(i)))+(1−y(i))log(1−h(x(i)))]
Let's analyze this cost:
This Log Loss function is convex, making it suitable for optimization algorithms like gradient descent to find the optimal parameters w and b.
Scikit-learn provides a straightforward implementation of Logistic Regression in the sklearn.linear_model.LogisticRegression
class.
Here's a basic example of how to use it:
# Import necessary libraries
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import make_classification # For generating sample data
# Generate some synthetic classification data
X, y = make_classification(n_samples=200, n_features=2, n_informative=2, n_redundant=0,
n_clusters_per_class=1, random_state=42, flip_y=0.1)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# 1. Instantiate the LogisticRegression model
# Common parameters:
# - C: Inverse of regularization strength (smaller C means stronger regularization). Default is 1.0.
# - penalty: Type of regularization ('l1', 'l2', 'elasticnet', 'none'). Default is 'l2'.
# - solver: Algorithm to use for optimization. Default often 'lbfgs'.
# Different solvers support different penalties.
log_reg = LogisticRegression(solver='liblinear', random_state=42) # liblinear is good for smaller datasets
# 2. Train (fit) the model on the training data
log_reg.fit(X_train, y_train)
# 3. Make predictions on the test data
y_pred = log_reg.predict(X_test)
# 4. Predict probabilities on the test data
# Returns an array of shape (n_samples, n_classes)
# Each row sums to 1. Columns correspond to classes sorted numerically.
y_pred_proba = log_reg.predict_proba(X_test)
# Evaluate the model (we'll cover metrics in detail later)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Coefficients (w): {log_reg.coef_}")
print(f"Model Intercept (b): {log_reg.intercept_}")
print(f"Sample Predicted Probabilities (first 5):\n{y_pred_proba[:5]}")
print(f"Sample Predictions (first 5): {y_pred[:5]}")
print(f"Test Accuracy: {accuracy:.4f}")
Key points about the Scikit-learn implementation:
LogisticRegression
applies L2 regularization (controlled by the C
parameter) to prevent overfitting, especially when dealing with many features. Lower values of C
increase the regularization strength.solver
) are available ('liblinear'
, 'lbfgs'
, 'sag'
, 'saga'
, 'newton-cg'
). Some solvers are faster for large datasets, while others support different regularization types.predict_proba
: This method is useful when you need the actual probability estimates, not just the final class prediction. It returns an array where each row corresponds to a sample, and each column corresponds to a class, containing the probability of that sample belonging to that class.What if you have more than two classes (e.g., {Cat, Dog, Bird})? Logistic Regression can be extended to handle multi-class problems using two main strategies:
LogisticRegression
. In this approach, for a problem with K classes, K separate binary logistic regression classifiers are trained. The first classifier predicts Class 1 vs. {All Other Classes}, the second predicts Class 2 vs. {All Other Classes}, and so on. When predicting for a new instance, all K classifiers provide a probability, and the class corresponding to the classifier with the highest probability is chosen.multi_class='multinomial'
along with a compatible solver like 'lbfgs'
or 'newton-cg'
.For many practical purposes, the default OvR strategy often works well.
Strengths:
Weaknesses:
Logistic Regression is a powerful yet simple classification algorithm that forms a cornerstone of many machine learning applications. Understanding its mechanics, including the sigmoid function, decision boundary, and cost function, provides a solid foundation before moving on to other classification techniques like K-Nearest Neighbors and Support Vector Machines, which we will discuss next.
© 2025 ApX Machine Learning