Managing preprocessing steps and model training separately can lead to repetitive code and potential errors, particularly concerning data leakage during evaluation. Scikit-learn provides an elegant solution: the Pipeline object found in the sklearn.pipeline module.
A Pipeline allows you to sequentially chain multiple data transformation steps (like scaling or encoding) and a final prediction step (an estimator like a classifier or regressor) into a single Scikit-learn object. This composite object behaves like a standard Scikit-learn estimator, having fit, predict, and potentially transform methods.
You create a Pipeline by providing a list of steps. Each step is defined as a tuple containing:
'scaler', 'classifier').StandardScaler(), LogisticRegression()).All steps except the last one must be transformers (i.e., they must have a transform method). The last step can be any estimator (transformer, classifier, regressor, etc.).
Let's illustrate with a common workflow: scaling numerical features and then training a Logistic Regression model.
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
# 1. Generate sample data
X, y = make_classification(n_samples=100, n_features=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
# 2. Define the steps for the pipeline
steps = [
('scaler', StandardScaler()), # Step 1: Scale data
('classifier', LogisticRegression()) # Step 2: Classify
]
# 3. Create the Pipeline object
pipe = Pipeline(steps=steps)
# 4. The pipeline object now acts like a single estimator
print(pipe)
# Fit the entire pipeline on the training data
# StandardScaler is fit_transformed, LogisticRegression is fit
pipe.fit(X_train, y_train)
# Make predictions on the test data
# StandardScaler transforms, LogisticRegression predicts
y_pred = pipe.predict(X_test)
# Evaluate (using accuracy for simplicity here)
accuracy = pipe.score(X_test, y_test)
print(f"\nPipeline Accuracy: {accuracy:.4f}")
Pipeline(steps=[('scaler', StandardScaler()),
('classifier', LogisticRegression())])
Pipeline Accuracy: 0.9200
In this example:
'scaler' using StandardScaler and 'classifier' using LogisticRegression.Pipeline constructor.pipe.fit(X_train, y_train) first calls fit_transform on the StandardScaler using X_train and y_train. The transformed X_train is then passed to the fit method of the LogisticRegression model along with y_train.pipe.predict(X_test) first calls transform on the already fitted StandardScaler using X_test. The transformed X_test is then passed to the predict method of the fitted LogisticRegression model.This ensures that the StandardScaler is fitted only on the training data, and both the training and test data are transformed using the same fitted scaler, preventing data leakage.
make_pipelineScikit-learn also offers a helper function, make_pipeline, which simplifies the creation process by automatically generating names for the steps. The names are derived from the lowercase class name of each component.
from sklearn.pipeline import make_pipeline
# Create the same pipeline using make_pipeline
# Names will be 'standardscaler' and 'logisticregression'
simple_pipe = make_pipeline(StandardScaler(), LogisticRegression())
print(simple_pipe)
# You can use it exactly like the previous pipeline
simple_pipe.fit(X_train, y_train)
accuracy_simple = simple_pipe.score(X_test, y_test)
print(f"\nmake_pipeline Accuracy: {accuracy_simple:.4f}")
Pipeline(steps=[('standardscaler', StandardScaler()),
('logisticregression', LogisticRegression())])
make_pipeline Accuracy: 0.9200
While make_pipeline is convenient for simple, linear pipelines, explicitly naming steps using the Pipeline constructor is often preferred when you need finer control, especially when accessing specific steps later or performing hyperparameter tuning, as we'll see shortly.
The data flows through the pipeline steps sequentially:
Data flows into the first step (Scaler). During fitting, it calls
fit_transform. During prediction/transformation, it callstransform. The output feeds into the next step (Estimator), which callsfitorpredictaccordingly, producing the final output.
Creating even these simple pipelines encapsulates your preprocessing and modeling logic into a single object. This simplifies your code, makes your workflow more reproducible, and, critically, sets the stage for correctly applying techniques like cross-validation and hyperparameter tuning without accidental data leakage, which we will explore next.
Was this section helpful?
Pipeline and make_pipeline objects, their usage, and the underlying mechanisms for chaining estimators.© 2026 ApX Machine LearningEngineered with