Data preprocessing involves scaling numerical features, encoding categorical ones, and handling missing values. A concrete example will demonstrate using Scikit-learn's transformers to apply these techniques to a small, representative dataset.Imagine we have a dataset containing information about employees, including their age, salary, department, and experience level. Our goal is to prepare this data for a potential machine learning model.First, let's create this sample dataset using Pandas:import pandas as pd import numpy as np # Create sample data data = { 'Age': [25, 45, 30, 55, 22, 38, np.nan, 41, 29, 50], 'Salary': [50000, 80000, 62000, 110000, 45000, 75000, 58000, np.nan, 60000, 98000], 'Department': ['HR', 'Engineering', 'Sales', 'Engineering', 'Sales', 'HR', 'Sales', 'Engineering', 'HR', 'Sales'], 'Experience_Level': ['Entry', 'Senior', 'Mid', 'Senior', 'Entry', 'Mid', 'Mid', 'Senior', 'Entry', 'Senior'] } df = pd.DataFrame(data) print("Original DataFrame:") print(df)Our DataFrame looks like this: Age Salary Department Experience_Level 0 25.0 50000.0 HR Entry 1 45.0 80000.0 Engineering Senior 2 30.0 62000.0 Sales Mid 3 55.0 110000.0 Engineering Senior 4 22.0 45000.0 Sales Entry 5 38.0 75000.0 HR Mid 6 NaN 58000.0 Sales Mid 7 41.0 NaN Engineering Senior 8 29.0 60000.0 HR Entry 9 50.0 98000.0 Sales SeniorWe can immediately spot several areas needing preprocessing:Missing Values: The 'Age' and 'Salary' columns have NaN entries.Numerical Features: 'Age' and 'Salary' are numerical and might benefit from scaling, especially if we plan to use algorithms sensitive to feature magnitudes (like KNN or regularized regression).Categorical Features: 'Department' is nominal (no inherent order), and 'Experience_Level' is ordinal (has a clear order: Entry < Mid < Senior). They need numerical encoding.Handling Missing ValuesLet's start by addressing the missing values using imputation. We'll use SimpleImputer to fill the missing 'Age' with the mean age and missing 'Salary' with the median salary (median is often preferred for potentially skewed data like salaries).from sklearn.impute import SimpleImputer # Impute 'Age' with mean mean_imputer = SimpleImputer(strategy='mean') # Note: Imputers expect 2D array-like input, hence df[['Age']] df['Age'] = mean_imputer.fit_transform(df[['Age']]) # Impute 'Salary' with median median_imputer = SimpleImputer(strategy='median') df['Salary'] = median_imputer.fit_transform(df[['Salary']]) print("\nDataFrame after Imputation:") print(df)The output shows the NaN values replaced:DataFrame after Imputation: Age Salary Department Experience_Level 0 25.000000 50000.0 HR Entry 1 45.000000 80000.0 Engineering Senior 2 30.000000 62000.0 Sales Mid 3 55.000000 110000.0 Engineering Senior 4 22.000000 45000.0 Sales Entry 5 38.000000 75000.0 HR Mid 6 37.222222 58000.0 Sales Mid # Imputed Age (mean) 7 41.000000 68500.0 Engineering Senior # Imputed Salary (median) 8 29.000000 60000.0 HR Entry 9 50.000000 98000.0 Sales SeniorNotice how fit_transform was used. The fit step calculates the statistic (mean or median) from the non-missing values in the column, and the transform step fills the missing entries using that calculated statistic.Scaling Numerical FeaturesNow, let's scale the 'Age' and 'Salary' columns. We'll use StandardScaler, which transforms data to have zero mean and unit variance ($$z = (x - \mu) / \sigma$$).from sklearn.preprocessing import StandardScaler scaler = StandardScaler() numerical_cols = ['Age', 'Salary'] # Fit and transform the numerical columns df[numerical_cols] = scaler.fit_transform(df[numerical_cols]) print("\nDataFrame after Scaling Numerical Features:") print(df[numerical_cols].head()) # Show only scaled columns for brevityThe output will show the scaled values:DataFrame after Scaling Numerical Features: Age Salary 0 -1.213538 -1.007992 1 0.770603 0.439769 2 -0.717503 -0.475179 3 1.762674 1.887530 4 -1.510762 -1.248957These values now represent standard deviations from the mean for each feature. For instance, the first employee's age is about 1.21 standard deviations below the mean age after imputation.Let's visualize the 'Salary' distribution before and after scaling:# Store original salary before scaling for comparison plot original_salary = median_imputer.transform(pd.DataFrame(data)['Salary']) # Get imputed original scale salary # Create plot data import plotly.graph_objects as go fig = go.Figure() fig.add_trace(go.Histogram(x=original_salary.flatten(), name='Original Salary', marker_color='#1f77b4', opacity=0.75)) fig.add_trace(go.Histogram(x=df['Salary'], name='Scaled Salary', marker_color='#ff7f0e', opacity=0.75)) # Update layout for clarity fig.update_layout( title_text='Salary Distribution Before and After Standardization', xaxis_title_text='Value', yaxis_title_text='Count', barmode='overlay', # Overlay histograms legend_title_text='Feature' ) fig.update_traces(opacity=0.7) # Adjust transparency # Convert to JSON string for display import plotly.io as pio plotly_json = pio.to_json(fig) print(f"\n```plotly\n{plotly_json}\n```") {"layout": {"title": {"text": "Salary Distribution Before and After Standardization"}, "xaxis": {"title": {"text": "Value"}}, "yaxis": {"title": {"text": "Count"}}, "barmode": "overlay", "legend": {"title": {"text": "Feature"}}}, "data": [{"type": "histogram", "x": [50000.0, 80000.0, 62000.0, 110000.0, 45000.0, 75000.0, 58000.0, 68500.0, 60000.0, 98000.0], "name": "Original Salary", "marker": {"color": "#1f77b4", "opacity": 0.75}, "opacity": 0.7}, {"type": "histogram", "x": [-1.007992195562181, 0.439769157308326, -0.4751791781343862, 1.887530510178833, -1.2489565775801822, 0.20859830869432495, -0.6671907437463873, -0.1660836963123837, -0.5711849609403867, 1.3947094800001823], "name": "Scaled Salary", "marker": {"color": "#ff7f0e", "opacity": 0.75}, "opacity": 0.7}]}The histograms show the distribution of salaries. The blue bars represent the original salaries (after imputation), and the orange bars show the salaries after standardization. Notice how the shape of the distribution is preserved, but the scale on the x-axis changes significantly, centering around zero for the scaled version.Encoding Categorical FeaturesNext, we handle the categorical columns.Nominal Feature: Department 'Department' has no inherent order, so One-Hot Encoding is appropriate. It creates new binary columns for each category.from sklearn.preprocessing import OneHotEncoder # Select the categorical column cat_col = ['Department'] one_hot_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore') # dense array for readability # Fit and transform one_hot_encoded = one_hot_encoder.fit_transform(df[cat_col]) # Create a new DataFrame with meaningful column names one_hot_df = pd.DataFrame(one_hot_encoded, columns=one_hot_encoder.get_feature_names_out(cat_col)) print("\nOne-Hot Encoded 'Department' (first 5 rows):") print(one_hot_df.head()) # We can drop the original 'Department' column and join the new ones df = df.drop(cat_col, axis=1) df = pd.concat([df.reset_index(drop=True), one_hot_df.reset_index(drop=True)], axis=1)The output shows the new binary columns:One-Hot Encoded 'Department' (first 5 rows): Department_Engineering Department_HR Department_Sales 0 0.0 1.0 0.0 1 1.0 0.0 0.0 2 0.0 0.0 1.0 3 1.0 0.0 0.0 4 0.0 0.0 1.0Ordinal Feature: Experience_Level 'Experience_Level' has an order ('Entry' < 'Mid' < 'Senior'). We can use OrdinalEncoder and specify the order.from sklearn.preprocessing import OrdinalEncoder # Define the desired order exp_order = ['Entry', 'Mid', 'Senior'] ordinal_encoder = OrdinalEncoder(categories=[exp_order]) # Pass order in a list # Fit and transform df['Experience_Level'] = ordinal_encoder.fit_transform(df[['Experience_Level']]) print("\nDataFrame after Ordinal Encoding 'Experience_Level':") print(df.head()) # Show full DataFrame nowThe final DataFrame after all preprocessing steps looks like this (showing the first 5 rows):DataFrame after Ordinal Encoding 'Experience_Level': Age Salary Experience_Level Department_Engineering Department_HR Department_Sales 0 -1.213538 -1.007992 0.0 0.0 1.0 0.0 1 0.770603 0.439769 2.0 1.0 0.0 0.0 2 -0.717503 -0.475179 1.0 0.0 0.0 1.0 3 1.762674 1.887530 2.0 1.0 0.0 0.0 4 -1.510762 -1.248957 0.0 0.0 0.0 1.0Our data is now fully numerical, scaled, and has no missing values. It's in a much better format for input into many Scikit-learn machine learning models.This hands-on exercise demonstrates how to apply individual preprocessing steps using Scikit-learn's transformers. Keep in mind the importance of fitting transformers only on your training data in a real ML workflow to avoid data leakage, and then using the same fitted transformer to transform both your training and testing data. In Chapter 6, you'll learn how to use Scikit-learn Pipelines to chain these steps together elegantly and safely.