Okay, let's put theory into practice. In this section, we will build a complete machine learning workflow using Pipeline
and ColumnTransformer
. We'll handle feature preprocessing (imputation, scaling, encoding) and model training within a single pipeline object, and then use GridSearchCV
to find the best hyperparameters for this combined workflow. This approach exemplifies how to create cleaner, more reliable machine learning systems.
We'll use a dataset that requires different preprocessing steps for different types of features, making it ideal for demonstrating ColumnTransformer
. Let's work with a simplified version of the popular Titanic dataset.
Imagine we want to predict passenger survival on the Titanic based on features like age, passenger class, sex, and embarkation point. This dataset contains numerical features (Age
, Fare
), categorical features (Pclass
, Sex
, Embarked
), and often includes missing values, requiring careful preprocessing.
First, let's prepare a sample dataset and perform the initial train-test split. For simplicity, we'll create a small Pandas DataFrame. In a real project, you would load this data from a file or database.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
# Sample Data (Simplified Titanic)
data = {
'Pclass': [3, 1, 3, 1, 2, 3, 1, 3, 2, 3],
'Sex': ['male', 'female', 'female', 'female', 'male', 'male', 'male', 'female', 'female', 'male'],
'Age': [22.0, 38.0, 26.0, 35.0, 35.0, np.nan, 54.0, 2.0, 27.0, 32.0],
'Fare': [7.25, 71.28, 7.92, 53.1, 8.05, 8.45, 51.86, 21.07, 13.00, 7.89],
'Embarked': ['S', 'C', 'S', 'S', 'S', 'Q', 'S', 'S', 'C', 'S'],
'Survived': [0, 1, 1, 1, 0, 0, 0, 0, 1, 0] # Target variable
}
df = pd.DataFrame(data)
# Separate features (X) and target (y)
X = df.drop('Survived', axis=1)
y = df['Survived']
# Identify feature types
numerical_features = ['Age', 'Fare']
categorical_features = ['Pclass', 'Sex', 'Embarked'] # Pclass treated as categorical here
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print("Training Features Shape:", X_train.shape)
print("Testing Features Shape:", X_test.shape)
We need different preprocessing for numerical and categorical features:
Age
, Fare
):
StandardScaler
).Pclass
, Sex
, Embarked
):
We can define small pipelines for each of these steps using Pipeline
.
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
# Preprocessing pipeline for numerical features
numerical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
# Preprocessing pipeline for categorical features
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore')) # Ignore categories in test set not seen in training
])
Now, we use ColumnTransformer
to apply the correct transformer pipeline to the corresponding columns.
from sklearn.compose import ColumnTransformer
# Create the preprocessor object using ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_features),
('cat', categorical_transformer, categorical_features)
],
remainder='passthrough' # Keep other columns (if any) - not strictly needed here
)
The ColumnTransformer
takes a list of tuples. Each tuple contains:
'num'
, 'cat'
).numerical_transformer
and categorical_transformer
pipelines).Let's combine the preprocessor
with a classifier, for example, LogisticRegression
, into a final Pipeline
.
from sklearn.linear_model import LogisticRegression
# Create the full pipeline including preprocessing and a classifier
model_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression(solver='liblinear', random_state=42)) # Using liblinear for simplicity
])
# Display the pipeline structure (optional)
from sklearn import set_config
set_config(display='diagram') # Activate diagram display if available
print(model_pipeline)
This model_pipeline
object now encapsulates our entire workflow: imputing, scaling, encoding, and classifying.
The real power comes when tuning hyperparameters across the entire pipeline. We can tune parameters of the classifier and potentially parameters within the preprocessing steps. Notice how we specify parameters using the step names followed by double underscores (__
).
Let's define a parameter grid to search over:
preprocessor__num__imputer__strategy
).C
parameter (inverse of regularization strength) for LogisticRegression
(classifier__C
).from sklearn.model_selection import GridSearchCV
# Define the parameter grid to search
# Note the naming convention: step_name__parameter_name
# For nested pipelines: outerstep_name__innerstep_name__parameter_name
param_grid = {
'preprocessor__num__imputer__strategy': ['mean', 'median'],
'classifier__C': [0.1, 1.0, 10.0],
# We could also tune categorical imputer strategy or OneHotEncoder params if needed
# 'preprocessor__cat__imputer__strategy': ['most_frequent', 'constant'],
# 'preprocessor__cat__onehot__handle_unknown': ['ignore', 'error'] # Usually 'ignore' is safer
}
# Create the GridSearchCV object
grid_search = GridSearchCV(model_pipeline, param_grid, cv=3, scoring='accuracy') # Using 3-fold CV for small dataset
# Fit GridSearchCV on the training data
# This will apply preprocessing *within* each CV fold correctly
grid_search.fit(X_train, y_train)
print("\nGrid Search Results:")
print(f"Best parameters found: {grid_search.best_params_}")
print(f"Best cross-validation accuracy score: {grid_search.best_score_:.4f}")
When grid_search.fit()
runs, it performs cross-validation. For each split within the cross-validation, it fits the entire pipeline (including the specified preprocessor
parameters for that iteration) on the training fold and evaluates it on the validation fold. This ensures that preprocessing steps like imputation and scaling are learned only from the training portion of each fold, preventing data leakage.
Finally, let's evaluate the performance of the best pipeline found by GridSearchCV
on our held-out test set. GridSearchCV
automatically refits the best model found on the entire training set, so grid_search.best_estimator_
is ready to use.
from sklearn.metrics import accuracy_score, classification_report
# Get the best pipeline found by GridSearchCV
best_pipeline = grid_search.best_estimator_
# Make predictions on the test set
y_pred = best_pipeline.predict(X_test)
# Evaluate the best pipeline
test_accuracy = accuracy_score(y_test, y_pred)
print(f"\nTest set accuracy of the best pipeline: {test_accuracy:.4f}")
print("\nClassification Report on Test Set:")
print(classification_report(y_test, y_pred))
This practical demonstrates how to construct a comprehensive pipeline using Pipeline
and ColumnTransformer
, and how to tune its hyperparameters efficiently using GridSearchCV
. This approach leads to more organized code, reduces the risk of common errors like data leakage during preprocessing, and simplifies the deployment of your machine learning models.
© 2025 ApX Machine Learning