Let's solidify the concepts covered in this chapter by building a practical data preparation pipeline. We'll take a sample dataset, apply common preprocessing steps like imputation, scaling, and encoding, and bundle them into a reusable Scikit-learn Pipeline
. This approach ensures consistency and prevents data leakage between your training and testing sets.
Imagine we have a dataset containing information about houses, with the goal of predicting their prices. Our dataset might include features like size (square feet), number of bedrooms, location (categorical), and age. It likely also contains missing values.
Let's create a small, representative Pandas DataFrame to work with:
import pandas as pd
import numpy as np
# Sample housing data
data = {
'SquareFeet': [1500, 2100, 1800, np.nan, 2500, 1200, 1900],
'Bedrooms': [3, 4, 3, 3, 5, 2, np.nan],
'Location': ['Urban', 'Suburban', 'Urban', 'Rural', 'Suburban', 'Rural', 'Urban'],
'Age': [5, 10, 8, 25, 1, 15, 7],
'Price': [300000, 450000, 350000, 180000, 550000, 200000, 380000]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Separate features (X) and target (y)
X = df.drop('Price', axis=1)
y = df['Price']
Our X
DataFrame looks like this:
SquareFeet Bedrooms Location Age
0 1500.0 3.0 Urban 5
1 2100.0 4.0 Suburban 10
2 1800.0 3.0 Urban 8
3 NaN 3.0 Rural 25
4 2500.0 5.0 Suburban 1
5 1200.0 2.0 Rural 15
6 1900.0 NaN Urban 7
Notice the missing values (NaN
) in SquareFeet
and Bedrooms
, and the categorical feature Location
.
Before any preprocessing, we must split our data into training and testing sets. This prevents information from the test set inadvertently influencing the preprocessing steps learned from the training set (data leakage).
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42 # Use random_state for reproducibility
)
print("\nTraining Features Shape:", X_train.shape)
print("Testing Features Shape:", X_test.shape)
We need different strategies for numerical and categorical columns:
SquareFeet
, Bedrooms
, Age
):
StandardScaler
).Location
):
Scikit-learn's Pipeline
and ColumnTransformer
are perfect tools for this. Pipeline
chains steps sequentially, while ColumnTransformer
applies different transformations to different columns.
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
# Identify numerical and categorical columns
numerical_features = ['SquareFeet', 'Bedrooms', 'Age']
categorical_features = ['Location']
# Create preprocessing pipelines for numerical features
# Step 1: Impute missing values with the median
# Step 2: Scale numerical features
numerical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
# Create preprocessing pipeline for categorical features
# Step 1: Impute missing values with the most frequent value (if any)
# Step 2: Apply One-Hot Encoding, ignore unknown categories encountered during transform
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
Now, we use ColumnTransformer
to apply the correct pipeline to the appropriate columns.
# Create a preprocessor object using ColumnTransformer
# Apply numerical_transformer to numerical_features
# Apply categorical_transformer to categorical_features
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_features),
('cat', categorical_transformer, categorical_features)
],
remainder='passthrough' # Keep other columns (if any) - not needed here but good practice
)
print("\nPreprocessor Structure:")
print(preprocessor)
This preprocessor
object encapsulates all our data preparation logic. It knows which columns are numerical, which are categorical, and the sequence of steps (impute, then scale/encode) to apply to each group.
While this chapter focuses on data preparation, you often include the actual machine learning model within the same pipeline. This ensures the entire workflow, from raw data to prediction, is streamlined. Let's add a placeholder model (e.g., Linear Regression) to illustrate.
from sklearn.linear_model import LinearRegression
# Create the full pipeline including the preprocessor and a model
# Step 1: Apply the preprocessor
# Step 2: Train a Linear Regression model
full_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('regressor', LinearRegression()) # Example model
])
print("\nFull Pipeline Structure:")
print(full_pipeline)
Now, the magic happens. We can fit
the entire pipeline on the training data. Scikit-learn handles applying each step correctly:
SimpleImputer
(median) inside numerical_transformer
is fitted only on the training data's numerical columns to learn the medians.StandardScaler
is fitted only on the imputed training data's numerical columns to learn the mean and standard deviation.SimpleImputer
(most frequent) inside categorical_transformer
is fitted only on the training data's categorical columns.OneHotEncoder
is fitted only on the imputed training data's categorical columns to learn the unique categories.LinearRegression
model is trained using the fully transformed training data.# Fit the pipeline to the training data
# This applies imputation, scaling, encoding, and model training
full_pipeline.fit(X_train, y_train)
print("\nPipeline fitted successfully on training data.")
# You can now use the fitted pipeline to transform the test data
# and make predictions
X_test_processed = full_pipeline.named_steps['preprocessor'].transform(X_test)
print("\nShape of processed test features:", X_test_processed.shape)
print("Sample processed test data (first row):\n", X_test_processed[0])
# Make predictions on the test set
predictions = full_pipeline.predict(X_test)
print("\nPredictions on test data:", predictions)
Notice that when we call transform
(or predict
) on the test data (X_test
), the pipeline uses the parameters (medians, means, standard deviations, categories) learned only from the training data (X_train
). This is exactly what we want to prevent data leakage and get a reliable estimate of model performance.
The output X_test_processed
is a NumPy array where missing values have been imputed, numerical features scaled, and categorical features one-hot encoded. It's ready to be fed directly into a machine learning model.
We can use Graphviz to visualize the structure of our preprocessor
or the full_pipeline
.
from sklearn import set_config
# Set display='diagram' to enable graphical representation
set_config(display='diagram')
# Display the preprocessor pipeline
print("\nPreprocessor Diagram:")
display(preprocessor) # In a Jupyter environment, this shows the diagram
# Display the full pipeline
print("\nFull Pipeline Diagram:")
display(full_pipeline) # Shows preprocessor + regressor
# Set display back to default if needed
# set_config(display='text')
If you are in an environment that supports rich display (like Jupyter), you will see interactive diagrams. Here's a static representation using graphviz syntax:
Structure of the data preparation steps combined using ColumnTransformer and optionally included within a full modeling Pipeline. Numerical features go through imputation and scaling, while categorical features undergo imputation and one-hot encoding.
This practice exercise demonstrated how to construct a robust data preparation workflow using Scikit-learn's Pipeline
and ColumnTransformer
. By defining separate steps for different data types and encapsulating them within a single object, you create code that is easier to manage, reuse, and less prone to errors like data leakage. This is a fundamental skill for any machine learning practitioner.
© 2025 ApX Machine Learning