As discussed in the chapter introduction, managing preprocessing steps and model training separately can lead to repetitive code and potential errors, particularly concerning data leakage during evaluation. Scikit-learn provides an elegant solution: the Pipeline
object found in the sklearn.pipeline
module.
A Pipeline
allows you to sequentially chain multiple data transformation steps (like scaling or encoding) and a final prediction step (an estimator like a classifier or regressor) into a single Scikit-learn object. This composite object behaves like a standard Scikit-learn estimator, having fit
, predict
, and potentially transform
methods.
You create a Pipeline
by providing a list of steps. Each step is defined as a tuple containing:
'scaler'
, 'classifier'
).StandardScaler()
, LogisticRegression()
).All steps except the last one must be transformers (i.e., they must have a transform
method). The last step can be any estimator (transformer, classifier, regressor, etc.).
Let's illustrate with a common workflow: scaling numerical features and then training a Logistic Regression model.
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
# 1. Generate sample data
X, y = make_classification(n_samples=100, n_features=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
# 2. Define the steps for the pipeline
steps = [
('scaler', StandardScaler()), # Step 1: Scale data
('classifier', LogisticRegression()) # Step 2: Classify
]
# 3. Create the Pipeline object
pipe = Pipeline(steps=steps)
# 4. The pipeline object now acts like a single estimator
print(pipe)
# Fit the entire pipeline on the training data
# StandardScaler is fit_transformed, LogisticRegression is fit
pipe.fit(X_train, y_train)
# Make predictions on the test data
# StandardScaler transforms, LogisticRegression predicts
y_pred = pipe.predict(X_test)
# Evaluate (using accuracy for simplicity here)
accuracy = pipe.score(X_test, y_test)
print(f"\nPipeline Accuracy: {accuracy:.4f}")
Pipeline(steps=[('scaler', StandardScaler()),
('classifier', LogisticRegression())])
Pipeline Accuracy: 0.9200
In this example:
'scaler'
using StandardScaler
and 'classifier'
using LogisticRegression
.Pipeline
constructor.pipe.fit(X_train, y_train)
first calls fit_transform
on the StandardScaler
using X_train
and y_train
. The transformed X_train
is then passed to the fit
method of the LogisticRegression
model along with y_train
.pipe.predict(X_test)
first calls transform
on the already fitted StandardScaler
using X_test
. The transformed X_test
is then passed to the predict
method of the fitted LogisticRegression
model.This ensures that the StandardScaler
is fitted only on the training data, and both the training and test data are transformed using the same fitted scaler, preventing data leakage.
make_pipeline
Scikit-learn also offers a helper function, make_pipeline
, which simplifies the creation process by automatically generating names for the steps. The names are derived from the lowercase class name of each component.
from sklearn.pipeline import make_pipeline
# Create the same pipeline using make_pipeline
# Names will be 'standardscaler' and 'logisticregression'
simple_pipe = make_pipeline(StandardScaler(), LogisticRegression())
print(simple_pipe)
# You can use it exactly like the previous pipeline
simple_pipe.fit(X_train, y_train)
accuracy_simple = simple_pipe.score(X_test, y_test)
print(f"\nmake_pipeline Accuracy: {accuracy_simple:.4f}")
Pipeline(steps=[('standardscaler', StandardScaler()),
('logisticregression', LogisticRegression())])
make_pipeline Accuracy: 0.9200
While make_pipeline
is convenient for simple, linear pipelines, explicitly naming steps using the Pipeline
constructor is often preferred when you need finer control, especially when accessing specific steps later or performing hyperparameter tuning, as we'll see shortly.
Conceptually, the data flows through the pipeline steps sequentially:
Data flows into the first step (Scaler). During fitting, it calls
fit_transform
. During prediction/transformation, it callstransform
. The output feeds into the next step (Estimator), which callsfit
orpredict
accordingly, producing the final output.
Creating even these simple pipelines encapsulates your preprocessing and modeling logic into a single object. This simplifies your code, makes your workflow more reproducible, and, critically, sets the stage for correctly applying techniques like cross-validation and hyperparameter tuning without accidental data leakage, which we will explore next.
© 2025 ApX Machine Learning