A complete machine learning workflow is built using Pipeline and ColumnTransformer. Feature preprocessing (imputation, scaling, encoding) and model training are handled within a single pipeline object. GridSearchCV is then used to find the best hyperparameters for this combined workflow. This approach exemplifies how to create cleaner, more reliable machine learning systems.We'll use a dataset that requires different preprocessing steps for different types of features, making it ideal for demonstrating ColumnTransformer. Let's work with a simplified version of the popular Titanic dataset.Setting Up the ScenarioImagine we want to predict passenger survival on the Titanic based on features like age, passenger class, sex, and embarkation point. This dataset contains numerical features (Age, Fare), categorical features (Pclass, Sex, Embarked), and often includes missing values, requiring careful preprocessing.First, let's prepare a sample dataset and perform the initial train-test split. For simplicity, we'll create a small Pandas DataFrame. In a real project, you would load this data from a file or database.import pandas as pd import numpy as np from sklearn.model_selection import train_test_split # Sample Data (Simplified Titanic) data = { 'Pclass': [3, 1, 3, 1, 2, 3, 1, 3, 2, 3], 'Sex': ['male', 'female', 'female', 'female', 'male', 'male', 'male', 'female', 'female', 'male'], 'Age': [22.0, 38.0, 26.0, 35.0, 35.0, np.nan, 54.0, 2.0, 27.0, 32.0], 'Fare': [7.25, 71.28, 7.92, 53.1, 8.05, 8.45, 51.86, 21.07, 13.00, 7.89], 'Embarked': ['S', 'C', 'S', 'S', 'S', 'Q', 'S', 'S', 'C', 'S'], 'Survived': [0, 1, 1, 1, 0, 0, 0, 0, 1, 0] # Target variable } df = pd.DataFrame(data) # Separate features (X) and target (y) X = df.drop('Survived', axis=1) y = df['Survived'] # Identify feature types numerical_features = ['Age', 'Fare'] categorical_features = ['Pclass', 'Sex', 'Embarked'] # Pclass treated as categorical here # Split data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) print("Training Features Shape:", X_train.shape) print("Testing Features Shape:", X_test.shape)Defining Preprocessing StepsWe need different preprocessing for numerical and categorical features:Numerical Features (Age, Fare):Impute missing values (e.g., using the median).Scale the features (e.g., using StandardScaler).Categorical Features (Pclass, Sex, Embarked):Impute missing values (e.g., using the most frequent value).Apply one-hot encoding to convert them into numerical format.We can define small pipelines for each of these steps using Pipeline.from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, OneHotEncoder # Preprocessing pipeline for numerical features numerical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()) ]) # Preprocessing pipeline for categorical features categorical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore')) # Ignore categories in test set not seen in training ])Combining Preprocessing with ColumnTransformerNow, we use ColumnTransformer to apply the correct transformer pipeline to the corresponding columns.from sklearn.compose import ColumnTransformer # Create the preprocessor object using ColumnTransformer preprocessor = ColumnTransformer( transformers=[ ('num', numerical_transformer, numerical_features), ('cat', categorical_transformer, categorical_features) ], remainder='passthrough' # Keep other columns (if any) - not strictly needed here )The ColumnTransformer takes a list of tuples. Each tuple contains:A name for the transformer step (e.g., 'num', 'cat').The transformer object (our numerical_transformer and categorical_transformer pipelines).A list of column names or indices to apply the transformer to.Building the Full PipelineLet's combine the preprocessor with a classifier, for example, LogisticRegression, into a final Pipeline.from sklearn.linear_model import LogisticRegression # Create the full pipeline including preprocessing and a classifier model_pipeline = Pipeline(steps=[ ('preprocessor', preprocessor), ('classifier', LogisticRegression(solver='liblinear', random_state=42)) # Using liblinear for simplicity ]) # Display the pipeline structure (optional) from sklearn import set_config set_config(display='diagram') # Activate diagram display if available print(model_pipeline)This model_pipeline object now encapsulates our entire workflow: imputing, scaling, encoding, and classifying.Tuning Hyperparameters with GridSearchCVThe real power comes when tuning hyperparameters across the entire pipeline. We can tune parameters of the classifier and potentially parameters within the preprocessing steps. Notice how we specify parameters using the step names followed by double underscores (__).Let's define a parameter grid to search over:The imputation strategy for numerical features (preprocessor__num__imputer__strategy).The C parameter (inverse of regularization strength) for LogisticRegression (classifier__C).from sklearn.model_selection import GridSearchCV # Define the parameter grid to search # Note the naming convention: step_name__parameter_name # For nested pipelines: outerstep_name__innerstep_name__parameter_name param_grid = { 'preprocessor__num__imputer__strategy': ['mean', 'median'], 'classifier__C': [0.1, 1.0, 10.0], # We could also tune categorical imputer strategy or OneHotEncoder params if needed # 'preprocessor__cat__imputer__strategy': ['most_frequent', 'constant'], # 'preprocessor__cat__onehot__handle_unknown': ['ignore', 'error'] # Usually 'ignore' is safer } # Create the GridSearchCV object grid_search = GridSearchCV(model_pipeline, param_grid, cv=3, scoring='accuracy') # Using 3-fold CV for small dataset # Fit GridSearchCV on the training data # This will apply preprocessing *within* each CV fold correctly grid_search.fit(X_train, y_train) print("\nGrid Search Results:") print(f"Best parameters found: {grid_search.best_params_}") print(f"Best cross-validation accuracy score: {grid_search.best_score_:.4f}")When grid_search.fit() runs, it performs cross-validation. For each split within the cross-validation, it fits the entire pipeline (including the specified preprocessor parameters for that iteration) on the training fold and evaluates it on the validation fold. This ensures that preprocessing steps like imputation and scaling are learned only from the training portion of each fold, preventing data leakage.Evaluating the Best PipelineFinally, let's evaluate the performance of the best pipeline found by GridSearchCV on our held-out test set. GridSearchCV automatically refits the best model found on the entire training set, so grid_search.best_estimator_ is ready to use.from sklearn.metrics import accuracy_score, classification_report # Get the best pipeline found by GridSearchCV best_pipeline = grid_search.best_estimator_ # Make predictions on the test set y_pred = best_pipeline.predict(X_test) # Evaluate the best pipeline test_accuracy = accuracy_score(y_test, y_pred) print(f"\nTest set accuracy of the best pipeline: {test_accuracy:.4f}") print("\nClassification Report on Test Set:") print(classification_report(y_test, y_pred))This practical demonstrates how to construct a comprehensive pipeline using Pipeline and ColumnTransformer, and how to tune its hyperparameters efficiently using GridSearchCV. This approach leads to more organized code, reduces the risk of common errors like data leakage during preprocessing, and simplifies the deployment of your machine learning models.