Let's put the theory of data preprocessing into practice. In the preceding sections, you learned about the importance of scaling numerical features, encoding categorical ones, and handling missing values. Now, we'll work through a concrete example using Scikit-learn's transformers to apply these techniques to a small, representative dataset.
Imagine we have a dataset containing information about employees, including their age, salary, department, and experience level. Our goal is to prepare this data for a potential machine learning model.
First, let's create this sample dataset using Pandas:
import pandas as pd
import numpy as np
# Create sample data
data = {
'Age': [25, 45, 30, 55, 22, 38, np.nan, 41, 29, 50],
'Salary': [50000, 80000, 62000, 110000, 45000, 75000, 58000, np.nan, 60000, 98000],
'Department': ['HR', 'Engineering', 'Sales', 'Engineering', 'Sales', 'HR', 'Sales', 'Engineering', 'HR', 'Sales'],
'Experience_Level': ['Entry', 'Senior', 'Mid', 'Senior', 'Entry', 'Mid', 'Mid', 'Senior', 'Entry', 'Senior']
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
Our DataFrame looks like this:
Age Salary Department Experience_Level
0 25.0 50000.0 HR Entry
1 45.0 80000.0 Engineering Senior
2 30.0 62000.0 Sales Mid
3 55.0 110000.0 Engineering Senior
4 22.0 45000.0 Sales Entry
5 38.0 75000.0 HR Mid
6 NaN 58000.0 Sales Mid
7 41.0 NaN Engineering Senior
8 29.0 60000.0 HR Entry
9 50.0 98000.0 Sales Senior
We can immediately spot several areas needing preprocessing:
NaN
entries.Let's start by addressing the missing values using imputation. We'll use SimpleImputer
to fill the missing 'Age' with the mean age and missing 'Salary' with the median salary (median is often preferred for potentially skewed data like salaries).
from sklearn.impute import SimpleImputer
# Impute 'Age' with mean
mean_imputer = SimpleImputer(strategy='mean')
# Note: Imputers expect 2D array-like input, hence df[['Age']]
df['Age'] = mean_imputer.fit_transform(df[['Age']])
# Impute 'Salary' with median
median_imputer = SimpleImputer(strategy='median')
df['Salary'] = median_imputer.fit_transform(df[['Salary']])
print("\nDataFrame after Imputation:")
print(df)
The output shows the NaN
values replaced:
DataFrame after Imputation:
Age Salary Department Experience_Level
0 25.000000 50000.0 HR Entry
1 45.000000 80000.0 Engineering Senior
2 30.000000 62000.0 Sales Mid
3 55.000000 110000.0 Engineering Senior
4 22.000000 45000.0 Sales Entry
5 38.000000 75000.0 HR Mid
6 37.222222 58000.0 Sales Mid # Imputed Age (mean)
7 41.000000 68500.0 Engineering Senior # Imputed Salary (median)
8 29.000000 60000.0 HR Entry
9 50.000000 98000.0 Sales Senior
Notice how fit_transform
was used. The fit
step calculates the statistic (mean or median) from the non-missing values in the column, and the transform
step fills the missing entries using that calculated statistic.
Now, let's scale the 'Age' and 'Salary' columns. We'll use StandardScaler
, which transforms data to have zero mean and unit variance (z=(x−μ)/σ).
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
numerical_cols = ['Age', 'Salary']
# Fit and transform the numerical columns
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])
print("\nDataFrame after Scaling Numerical Features:")
print(df[numerical_cols].head()) # Show only scaled columns for brevity
The output will show the scaled values:
DataFrame after Scaling Numerical Features:
Age Salary
0 -1.213538 -1.007992
1 0.770603 0.439769
2 -0.717503 -0.475179
3 1.762674 1.887530
4 -1.510762 -1.248957
These values now represent standard deviations from the mean for each feature. For instance, the first employee's age is about 1.21 standard deviations below the mean age after imputation.
Let's visualize the 'Salary' distribution before and after scaling:
# Store original salary before scaling for comparison plot
original_salary = median_imputer.transform(pd.DataFrame(data)['Salary']) # Get imputed original scale salary
# Create plot data
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Histogram(x=original_salary.flatten(), name='Original Salary', marker_color='#1f77b4', opacity=0.75))
fig.add_trace(go.Histogram(x=df['Salary'], name='Scaled Salary', marker_color='#ff7f0e', opacity=0.75))
# Update layout for clarity
fig.update_layout(
title_text='Salary Distribution Before and After Standardization',
xaxis_title_text='Value',
yaxis_title_text='Count',
barmode='overlay', # Overlay histograms
legend_title_text='Feature'
)
fig.update_traces(opacity=0.7) # Adjust transparency
# Convert to JSON string for display
import plotly.io as pio
plotly_json = pio.to_json(fig)
print(f"\n```plotly\n{plotly_json}\n```")
The histograms show the distribution of salaries. The blue bars represent the original salaries (after imputation), and the orange bars show the salaries after standardization. Notice how the shape of the distribution is preserved, but the scale on the x-axis changes significantly, centering around zero for the scaled version.
Next, we handle the categorical columns.
Nominal Feature: Department 'Department' has no inherent order, so One-Hot Encoding is appropriate. It creates new binary columns for each category.
from sklearn.preprocessing import OneHotEncoder
# Select the categorical column
cat_col = ['Department']
one_hot_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore') # dense array for readability
# Fit and transform
one_hot_encoded = one_hot_encoder.fit_transform(df[cat_col])
# Create a new DataFrame with meaningful column names
one_hot_df = pd.DataFrame(one_hot_encoded, columns=one_hot_encoder.get_feature_names_out(cat_col))
print("\nOne-Hot Encoded 'Department' (first 5 rows):")
print(one_hot_df.head())
# We can drop the original 'Department' column and join the new ones
df = df.drop(cat_col, axis=1)
df = pd.concat([df.reset_index(drop=True), one_hot_df.reset_index(drop=True)], axis=1)
The output shows the new binary columns:
One-Hot Encoded 'Department' (first 5 rows):
Department_Engineering Department_HR Department_Sales
0 0.0 1.0 0.0
1 1.0 0.0 0.0
2 0.0 0.0 1.0
3 1.0 0.0 0.0
4 0.0 0.0 1.0
Ordinal Feature: Experience_Level
'Experience_Level' has an order ('Entry' < 'Mid' < 'Senior'). We can use OrdinalEncoder
and specify the order.
from sklearn.preprocessing import OrdinalEncoder
# Define the desired order
exp_order = ['Entry', 'Mid', 'Senior']
ordinal_encoder = OrdinalEncoder(categories=[exp_order]) # Pass order in a list
# Fit and transform
df['Experience_Level'] = ordinal_encoder.fit_transform(df[['Experience_Level']])
print("\nDataFrame after Ordinal Encoding 'Experience_Level':")
print(df.head()) # Show full DataFrame now
The final DataFrame after all preprocessing steps looks like this (showing the first 5 rows):
DataFrame after Ordinal Encoding 'Experience_Level':
Age Salary Experience_Level Department_Engineering Department_HR Department_Sales
0 -1.213538 -1.007992 0.0 0.0 1.0 0.0
1 0.770603 0.439769 2.0 1.0 0.0 0.0
2 -0.717503 -0.475179 1.0 0.0 0.0 1.0
3 1.762674 1.887530 2.0 1.0 0.0 0.0
4 -1.510762 -1.248957 0.0 0.0 0.0 1.0
Our data is now fully numerical, scaled, and has no missing values. It's in a much better format for input into many Scikit-learn machine learning models.
This hands-on exercise demonstrates how to apply individual preprocessing steps using Scikit-learn's transformers. Keep in mind the importance of fitting transformers only on your training data in a real ML workflow to avoid data leakage, and then using the same fitted transformer to transform both your training and testing data. In Chapter 6, you'll learn how to use Scikit-learn Pipelines to chain these steps together elegantly and safely.
© 2025 ApX Machine Learning