A practical data preparation pipeline is built. This pipeline processes a sample dataset, applying common preprocessing steps such as imputation, scaling, and encoding. These steps are then bundled into a reusable Scikit-learn Pipeline. This approach ensures consistency and prevents data leakage between training and testing sets.Setting the Stage: The Dataset and GoalImagine we have a dataset containing information about houses, with the goal of predicting their prices. Our dataset might include features like size (square feet), number of bedrooms, location (categorical), and age. It likely also contains missing values.Let's create a small, representative Pandas DataFrame to work with:import pandas as pd import numpy as np # Sample housing data data = { 'SquareFeet': [1500, 2100, 1800, np.nan, 2500, 1200, 1900], 'Bedrooms': [3, 4, 3, 3, 5, 2, np.nan], 'Location': ['Urban', 'Suburban', 'Urban', 'Rural', 'Suburban', 'Rural', 'Urban'], 'Age': [5, 10, 8, 25, 1, 15, 7], 'Price': [300000, 450000, 350000, 180000, 550000, 200000, 380000] } df = pd.DataFrame(data) print("Original DataFrame:") print(df) # Separate features (X) and target (y) X = df.drop('Price', axis=1) y = df['Price']Our X DataFrame looks like this: SquareFeet Bedrooms Location Age 0 1500.0 3.0 Urban 5 1 2100.0 4.0 Suburban 10 2 1800.0 3.0 Urban 8 3 NaN 3.0 Rural 25 4 2500.0 5.0 Suburban 1 5 1200.0 2.0 Rural 15 6 1900.0 NaN Urban 7Notice the missing values (NaN) in SquareFeet and Bedrooms, and the categorical feature Location.Step 1: Splitting the DataBefore any preprocessing, we must split our data into training and testing sets. This prevents information from the test set inadvertently influencing the preprocessing steps learned from the training set (data leakage).from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42 # Use random_state for reproducibility ) print("\nTraining Features Shape:", X_train.shape) print("Testing Features Shape:", X_test.shape)Step 2: Defining Preprocessing Steps for Different Column TypesWe need different strategies for numerical and categorical columns:Numerical Columns (SquareFeet, Bedrooms, Age):Impute missing values (e.g., using the median).Scale the features (e.g., using StandardScaler).Categorical Columns (Location):Impute missing values (if any, e.g., using the most frequent value).Apply one-hot encoding.Scikit-learn's Pipeline and ColumnTransformer are perfect tools for this. Pipeline chains steps sequentially, while ColumnTransformer applies different transformations to different columns.from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer # Identify numerical and categorical columns numerical_features = ['SquareFeet', 'Bedrooms', 'Age'] categorical_features = ['Location'] # Create preprocessing pipelines for numerical features # Step 1: Impute missing values with the median # Step 2: Scale numerical features numerical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()) ]) # Create preprocessing pipeline for categorical features # Step 1: Impute missing values with the most frequent value (if any) # Step 2: Apply One-Hot Encoding, ignore unknown categories encountered during transform categorical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore')) ])Step 3: Combining Preprocessing Steps with ColumnTransformerNow, we use ColumnTransformer to apply the correct pipeline to the appropriate columns.# Create a preprocessor object using ColumnTransformer # Apply numerical_transformer to numerical_features # Apply categorical_transformer to categorical_features preprocessor = ColumnTransformer( transformers=[ ('num', numerical_transformer, numerical_features), ('cat', categorical_transformer, categorical_features) ], remainder='passthrough' # Keep other columns (if any) - not needed here but good practice ) print("\nPreprocessor Structure:") print(preprocessor)This preprocessor object encapsulates all our data preparation logic. It knows which columns are numerical, which are categorical, and the sequence of steps (impute, then scale/encode) to apply to each group.Step 4: Building the Full Pipeline (Optional: Including a Model)While this chapter focuses on data preparation, you often include the actual machine learning model within the same pipeline. This ensures the entire workflow, from raw data to prediction, is streamlined. Let's add a placeholder model (e.g., Linear Regression) to illustrate.from sklearn.linear_model import LinearRegression # Create the full pipeline including the preprocessor and a model # Step 1: Apply the preprocessor # Step 2: Train a Linear Regression model full_pipeline = Pipeline(steps=[ ('preprocessor', preprocessor), ('regressor', LinearRegression()) # Example model ]) print("\nFull Pipeline Structure:") print(full_pipeline)Step 5: Applying the Pipeline to the DataNow, the magic happens. We can fit the entire pipeline on the training data. Scikit-learn handles applying each step correctly:The SimpleImputer (median) inside numerical_transformer is fitted only on the training data's numerical columns to learn the medians.The StandardScaler is fitted only on the imputed training data's numerical columns to learn the mean and standard deviation.The SimpleImputer (most frequent) inside categorical_transformer is fitted only on the training data's categorical columns.The OneHotEncoder is fitted only on the imputed training data's categorical columns to learn the unique categories.Finally, the (optional) LinearRegression model is trained using the fully transformed training data.# Fit the pipeline to the training data # This applies imputation, scaling, encoding, and model training full_pipeline.fit(X_train, y_train) print("\nPipeline fitted successfully on training data.") # You can now use the fitted pipeline to transform the test data # and make predictions X_test_processed = full_pipeline.named_steps['preprocessor'].transform(X_test) print("\nShape of processed test features:", X_test_processed.shape) print("Sample processed test data (first row):\n", X_test_processed[0]) # Make predictions on the test set predictions = full_pipeline.predict(X_test) print("\nPredictions on test data:", predictions)Notice that when we call transform (or predict) on the test data (X_test), the pipeline uses the parameters (medians, means, standard deviations, categories) learned only from the training data (X_train). This is exactly what we want to prevent data leakage and get a reliable estimate of model performance.The output X_test_processed is a NumPy array where missing values have been imputed, numerical features scaled, and categorical features one-hot encoded. It's ready to be fed directly into a machine learning model.Visualizing the Pipeline StructureWe can use Graphviz to visualize the structure of our preprocessor or the full_pipeline.from sklearn import set_config # Set display='diagram' to enable graphical representation set_config(display='diagram') # Display the preprocessor pipeline print("\nPreprocessor Diagram:") display(preprocessor) # In a Jupyter environment, this shows the diagram # Display the full pipeline print("\nFull Pipeline Diagram:") display(full_pipeline) # Shows preprocessor + regressor # Set display back to default if needed # set_config(display='text')If you are in an environment that supports rich display (like Jupyter), you will see interactive diagrams. Here's a static representation using graphviz syntax:digraph G { rankdir=LR; splines=ortho; node [shape=box, style="rounded,filled", fillcolor="#e9ecef", margin=0.2]; edge [arrowhead=none, color="#495057"]; subgraph cluster_num_pipeline { label = "Numerical Pipeline"; style="rounded,filled"; fillcolor="#a5d8ff"; // Light blue background node [fillcolor="#d0bfff"]; // Violet nodes num_imputer [label="SimpleImputer\n(strategy='median')"]; num_scaler [label="StandardScaler"]; num_imputer -> num_scaler; } subgraph cluster_cat_pipeline { label = "Categorical Pipeline"; style="rounded,filled"; fillcolor="#b2f2bb"; // Light green background node [fillcolor="#ffec99"]; // Yellow nodes cat_imputer [label="SimpleImputer\n(strategy='most_frequent')"]; cat_encoder [label="OneHotEncoder\n(handle_unknown='ignore')"]; cat_imputer -> cat_encoder; } subgraph cluster_col_transformer { label = "ColumnTransformer"; style="rounded"; node [fillcolor="#ffc9c9"]; // Red node col_trans [label="Transformer"]; col_trans -> num_imputer [lhead=cluster_num_pipeline, label="Numerical Features\n['SquareFeet', 'Bedrooms', 'Age']", color="#1c7ed6"]; col_trans -> cat_imputer [lhead=cluster_cat_pipeline, label="Categorical Features\n['Location']", color="#37b24d"]; } subgraph cluster_full_pipeline { label = "Full Pipeline"; style=dashed; node [fillcolor="#ffd8a8"]; // Orange node model [label="LinearRegression"]; col_trans -> model; // Connect transformer output to model input } }Structure of the data preparation steps combined using ColumnTransformer and optionally included within a full modeling Pipeline. Numerical features go through imputation and scaling, while categorical features undergo imputation and one-hot encoding.This practice exercise demonstrated how to construct a data preparation workflow using Scikit-learn's Pipeline and ColumnTransformer. By defining separate steps for different data types and encapsulating them within a single object, you create code that is easier to manage, reuse, and less prone to errors like data leakage. This is a fundamental skill for any machine learning practitioner.