Chaining multiple preprocessing and modeling steps is a common practice in machine learning. Datasets frequently contain different types of features, such as numerical, categorical, and text, each requiring distinct preprocessing techniques. For instance, applying a single scaling method to both numerical and categorical columns is inappropriate. When transformations are applied sequentially, each step typically processes the entire output of the preceding operation. This approach becomes unsuitable when different transformations are needed for various subsets of the original features.
This is where Scikit-learn's ColumnTransformer comes into play. It allows you to apply different transformers to different columns of your input data in parallel. The results from applying each transformer are then concatenated horizontally to form the final transformed dataset, which can then be passed to the next step in a larger pipeline, typically an estimator.
The core idea is to specify which transformers should be applied to which columns. You provide ColumnTransformer with a list of tuples, where each tuple contains:
'numerical_scaling', 'categorical_encoding').StandardScaler(), OneHotEncoder()) or even a Pipeline itself.make_column_selector.Let's illustrate with an example. Imagine a dataset with numerical features (age, income) and a categorical feature (city). We want to scale the numerical features and one-hot encode the categorical feature.
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
# Sample Data
data = {
'age': [25, 45, 30, 55, 40, 28],
'income': [50000, 80000, 60000, 120000, 75000, 52000],
'city': ['New York', 'London', 'Paris', 'New York', 'London', 'Paris'],
'target': [0, 1, 0, 1, 1, 0]
}
df = pd.DataFrame(data)
X = df[['age', 'income', 'city']]
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Define preprocessing steps for different column types
# Use make_column_selector for convenience
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), make_column_selector(dtype_include=np.number)),
('cat', OneHotEncoder(handle_unknown='ignore'), make_column_selector(dtype_include=object))
],
remainder='passthrough' # Keep other columns (if any)
)
# Create the full pipeline including the preprocessor and a classifier
model_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression())
])
# Fit the pipeline
model_pipeline.fit(X_train, y_train)
# Make predictions (preprocessing is applied automatically)
predictions = model_pipeline.predict(X_test)
print("Sample Predictions:", predictions)
print("Pipeline Score:", model_pipeline.score(X_test, y_test))
# You can inspect the fitted transformers within the ColumnTransformer
# Access the preprocessor step
ct = model_pipeline.named_steps['preprocessor']
# Access the scaler fitted on numerical columns
scaler = ct.named_transformers_['num']
print("\nFitted Scaler Mean:", scaler.mean_)
# Access the one-hot encoder fitted on categorical columns
encoder = ct.named_transformers_['cat']
print("Fitted Encoder Categories:", encoder.categories_)
In this example:
make_column_selector(dtype_include=np.number) to select all columns with numerical data types (age, income). StandardScaler is applied to these.make_column_selector(dtype_include=object) to select columns with object data types (city), typically strings in Pandas. OneHotEncoder is applied to this column. handle_unknown='ignore' is a useful parameter that prevents errors if the encoder encounters categories in the test set that weren't seen during training; it simply encodes them as all zeros.ColumnTransformer applies these transformers to their respective columns.LogisticRegression classifier within the main Pipeline.remainder ParameterWhat happens to columns that are not selected by any of the transformers specified in the transformers list? This is controlled by the remainder parameter of ColumnTransformer:
remainder='drop' (Default): Any columns not explicitly selected by any transformer are dropped from the dataset. Be careful with this default, as you might unintentionally lose features.remainder='passthrough': Any columns not selected are kept and appended to the output of the specified transformations. Their values remain unchanged. This is useful if you have columns that don't require preprocessing or are handled by a later step.remainder=<estimator>: You can also specify a transformer (like SimpleImputer or StandardScaler) to be applied to all remaining columns.Choosing the correct remainder setting is important for ensuring all necessary features are processed correctly and make it to the final estimator.
Besides make_column_selector, you can specify columns for each transformer in ColumnTransformer using:
['age', 'income'][0, 1]slice(0, 2)[True, True, False]Using make_column_selector is often preferred as it's less brittle to changes in column order or additions/removals compared to using integer indices.
We can visualize the structure created by combining ColumnTransformer and Pipeline:
Flow diagram showing input data split by the ColumnTransformer ('preprocessor'). Numerical columns go to StandardScaler ('num'), categorical columns go to OneHotEncoder ('cat'). Their outputs are concatenated and then passed to the LogisticRegression classifier ('classifier').
Using ColumnTransformer within a Pipeline works with GridSearchCV. You can tune hyperparameters of both the transformers within ColumnTransformer and the final estimator. Parameter names are constructed using the step names separated by double underscores (__).
For example, to tune the handle_unknown parameter of the OneHotEncoder (named 'cat') inside the ColumnTransformer (named 'preprocessor'), the parameter grid key would be 'preprocessor__cat__handle_unknown'. To tune the C parameter of the LogisticRegression (named 'classifier'), the key is 'classifier__C'.
# Example parameter grid for GridSearchCV with the pipeline above
param_grid = {
'preprocessor__num__with_mean': [True, False], # Parameter for StandardScaler
'preprocessor__cat__handle_unknown': ['ignore', 'error'], # Parameter for OneHotEncoder
'classifier__C': [0.1, 1.0, 10.0] # Parameter for LogisticRegression
}
# GridSearchCV would then be applied to 'model_pipeline' using this 'param_grid'
# from sklearn.model_selection import GridSearchCV
# grid_search = GridSearchCV(model_pipeline, param_grid, cv=5)
# grid_search.fit(X_train, y_train)
"ColumnTransformer is a fundamental tool for building practical machine learning pipelines in Scikit-learn, especially when dealing with heterogeneous datasets typical of problems. It allows for clean, modular, and correct application of preprocessing steps tailored to specific feature types, all while integrating perfectly with Scikit-learn's ecosystem for model evaluation and selection."
Was this section helpful?
© 2026 ApX Machine LearningEngineered with