While Pipeline
is excellent for chaining steps sequentially, real-world datasets often contain different types of features (numerical, categorical, text) that require distinct preprocessing techniques. Applying a single scaler to both numerical and categorical columns, for instance, is inappropriate. A simple Pipeline
applies each step to the entire output of the previous step, making it unsuitable for applying different transformations to different subsets of the original features.
This is where Scikit-learn's ColumnTransformer
comes into play. It allows you to apply different transformers to different columns of your input data in parallel. The results from applying each transformer are then concatenated horizontally to form the final transformed dataset, which can then be passed to the next step in a larger pipeline, typically an estimator.
The core idea is to specify which transformers should be applied to which columns. You provide ColumnTransformer
with a list of tuples, where each tuple contains:
'numerical_scaling'
, 'categorical_encoding'
).StandardScaler()
, OneHotEncoder()
) or even a Pipeline
itself.make_column_selector
.Let's illustrate with an example. Imagine a dataset with numerical features (age, income) and a categorical feature (city). We want to scale the numerical features and one-hot encode the categorical feature.
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
# Sample Data
data = {
'age': [25, 45, 30, 55, 40, 28],
'income': [50000, 80000, 60000, 120000, 75000, 52000],
'city': ['New York', 'London', 'Paris', 'New York', 'London', 'Paris'],
'target': [0, 1, 0, 1, 1, 0]
}
df = pd.DataFrame(data)
X = df[['age', 'income', 'city']]
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Define preprocessing steps for different column types
# Use make_column_selector for convenience
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), make_column_selector(dtype_include=np.number)),
('cat', OneHotEncoder(handle_unknown='ignore'), make_column_selector(dtype_include=object))
],
remainder='passthrough' # Keep other columns (if any)
)
# Create the full pipeline including the preprocessor and a classifier
model_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression())
])
# Fit the pipeline
model_pipeline.fit(X_train, y_train)
# Make predictions (preprocessing is applied automatically)
predictions = model_pipeline.predict(X_test)
print("Sample Predictions:", predictions)
print("Pipeline Score:", model_pipeline.score(X_test, y_test))
# You can inspect the fitted transformers within the ColumnTransformer
# Access the preprocessor step
ct = model_pipeline.named_steps['preprocessor']
# Access the scaler fitted on numerical columns
scaler = ct.named_transformers_['num']
print("\nFitted Scaler Mean:", scaler.mean_)
# Access the one-hot encoder fitted on categorical columns
encoder = ct.named_transformers_['cat']
print("Fitted Encoder Categories:", encoder.categories_)
In this example:
make_column_selector(dtype_include=np.number)
to select all columns with numerical data types (age
, income
). StandardScaler
is applied to these.make_column_selector(dtype_include=object)
to select columns with object data types (city
), typically strings in Pandas. OneHotEncoder
is applied to this column. handle_unknown='ignore'
is a useful parameter that prevents errors if the encoder encounters categories in the test set that weren't seen during training; it simply encodes them as all zeros.ColumnTransformer
applies these transformers to their respective columns.LogisticRegression
classifier within the main Pipeline
.remainder
ParameterWhat happens to columns that are not selected by any of the transformers specified in the transformers
list? This is controlled by the remainder
parameter of ColumnTransformer
:
remainder='drop'
(Default): Any columns not explicitly selected by any transformer are dropped from the dataset. Be careful with this default, as you might unintentionally lose features.remainder='passthrough'
: Any columns not selected are kept and appended to the output of the specified transformations. Their values remain unchanged. This is useful if you have columns that don't require preprocessing or are handled by a later step.remainder=<estimator>
: You can also specify a transformer (like SimpleImputer
or StandardScaler
) to be applied to all remaining columns.Choosing the correct remainder
setting is important for ensuring all necessary features are processed correctly and make it to the final estimator.
Besides make_column_selector
, you can specify columns for each transformer in ColumnTransformer
using:
['age', 'income']
[0, 1]
slice(0, 2)
[True, True, False]
Using make_column_selector
is often preferred as it's less brittle to changes in column order or additions/removals compared to using integer indices.
We can visualize the structure created by combining ColumnTransformer
and Pipeline
:
Flow diagram showing input data split by the ColumnTransformer ('preprocessor'). Numerical columns go to StandardScaler ('num'), categorical columns go to OneHotEncoder ('cat'). Their outputs are concatenated and then passed to the LogisticRegression classifier ('classifier').
Using ColumnTransformer
within a Pipeline
works seamlessly with GridSearchCV
. You can tune hyperparameters of both the transformers within ColumnTransformer
and the final estimator. Parameter names are constructed using the step names separated by double underscores (__
).
For example, to tune the handle_unknown
parameter of the OneHotEncoder
(named 'cat'
) inside the ColumnTransformer
(named 'preprocessor'
), the parameter grid key would be 'preprocessor__cat__handle_unknown'
. To tune the C
parameter of the LogisticRegression
(named 'classifier'
), the key is 'classifier__C'
.
# Example parameter grid for GridSearchCV with the pipeline above
param_grid = {
'preprocessor__num__with_mean': [True, False], # Parameter for StandardScaler
'preprocessor__cat__handle_unknown': ['ignore', 'error'], # Parameter for OneHotEncoder
'classifier__C': [0.1, 1.0, 10.0] # Parameter for LogisticRegression
}
# GridSearchCV would then be applied to 'model_pipeline' using this 'param_grid'
# from sklearn.model_selection import GridSearchCV
# grid_search = GridSearchCV(model_pipeline, param_grid, cv=5)
# grid_search.fit(X_train, y_train)
ColumnTransformer
is a fundamental tool for building practical machine learning pipelines in Scikit-learn, especially when dealing with heterogeneous datasets typical of real-world problems. It allows for clean, modular, and correct application of preprocessing steps tailored to specific feature types, all while integrating perfectly with Scikit-learn's ecosystem for model evaluation and selection.
© 2025 ApX Machine Learning