Now that we've discussed the concepts behind preparing data, scaling numerical features and encoding categorical ones, let's put theory into practice. In this section, we'll walk through a typical preprocessing workflow using Python, focusing on the popular libraries pandas
for data manipulation and scikit-learn
for preprocessing tools.
Imagine we have a small dataset designed to predict whether a student will pass an exam based on their study hours, their score on a previous related quiz, and the primary method they used to study.
First, let's create a sample dataset using pandas
. This mimics how you might load data from a CSV file or database in a real project.
import pandas as pd
import numpy as np
# Create sample data
data = {
'Study_Hours': [2.5, 5.1, 1.3, 8.7, 4.5, 9.8, 3.3, 6.1, 7.0, 0.5],
'Previous_Score': [65, 82, 50, 95, 75, 88, 68, 85, 91, 45],
'Study_Method': ['Solo', 'Group', 'Solo', 'Group', 'Group', 'Group', 'Solo', 'Solo', 'Group', 'Solo'],
'Passed_Exam': [0, 1, 0, 1, 1, 1, 0, 1, 1, 0] # Target variable (0=Fail, 1=Pass)
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
This gives us a DataFrame df
with our features (Study_Hours
, Previous_Score
, Study_Method
) and a target variable (Passed_Exam
) that we want to predict.
Before applying any transformations, we separate our features from the target variable and identify which features are numerical and which are categorical.
# Separate features (X) and target (y)
X = df.drop('Passed_Exam', axis=1)
y = df['Passed_Exam']
# Identify numerical and categorical features
numerical_features = ['Study_Hours', 'Previous_Score']
categorical_features = ['Study_Method']
print("\nFeatures (X):")
print(X.head())
print("\nTarget (y):")
print(y.head())
As discussed in the section on Feature Scaling, numerical features often benefit from scaling, especially for algorithms sensitive to feature magnitudes like neural networks. Let's apply Standardization (Z-score scaling) using scikit-learn
's StandardScaler
. This transforms the data such that it has a mean of 0 and a standard deviation of 1 (x′=σx−μ).
from sklearn.preprocessing import StandardScaler
# Initialize the scaler
scaler = StandardScaler()
# Fit the scaler to the numerical data and transform it
X_numerical_scaled = scaler.fit_transform(X[numerical_features])
# Convert back to DataFrame for better readability (optional)
X_numerical_scaled_df = pd.DataFrame(X_numerical_scaled, columns=numerical_features, index=X.index)
print("\nNumerical Features Before Scaling:")
print(X[numerical_features].describe())
print("\nNumerical Features After Standardization:")
print(X_numerical_scaled_df.describe())
# Note: Mean is approx 0 and Std Dev is approx 1 after scaling
You'll notice the mean
is very close to zero and the std
(standard deviation) is close to one for both Study_Hours
and Previous_Score
after scaling.
Let's visualize the distribution of Previous_Score
before and after scaling using a histogram.
Comparison of the distribution for the 'Previous_Score' feature before (blue) and after (orange) applying StandardScaler. The scaled distribution is centered around zero.
Alternatively, we could have used MinMaxScaler
to scale the data into a specific range, typically [0, 1], using the formula x′=max(x)−min(x)x−min(x). The choice often depends on the specific data and network architecture, but StandardScaler
is a common default.
Our Study_Method
column contains text categories ('Solo', 'Group'). Neural networks require numerical input, so we need to encode this feature. One-hot encoding is a standard technique for nominal categorical variables (where there's no inherent order). It creates new binary (0 or 1) columns for each category.
We'll use scikit-learn
's OneHotEncoder
.
from sklearn.preprocessing import OneHotEncoder
# Initialize the encoder
# handle_unknown='ignore' prevents errors if unseen categories appear later (e.g., in test data)
# sparse_output=False returns a dense numpy array instead of a sparse matrix
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
# Fit the encoder to the categorical data and transform it
X_categorical_encoded = encoder.fit_transform(X[categorical_features])
# Get the new feature names generated by the encoder
encoded_feature_names = encoder.get_feature_names_out(categorical_features)
# Convert to DataFrame (optional, for readability)
X_categorical_encoded_df = pd.DataFrame(X_categorical_encoded, columns=encoded_feature_names, index=X.index)
print("\nCategorical Features Before Encoding:")
print(X[categorical_features].head())
print("\nCategorical Features After One-Hot Encoding:")
print(X_categorical_encoded_df.head())
As you can see, the single Study_Method
column is replaced by two new columns: Study_Method_Group
and Study_Method_Solo
. A '1' in a column indicates the presence of that category for the given sample.
Now, we combine our scaled numerical features and our encoded categorical features into a single array or DataFrame. This combined dataset is what we would typically feed into a neural network.
# Concatenate scaled numerical and encoded categorical features
X_processed = pd.concat([X_numerical_scaled_df, X_categorical_encoded_df], axis=1)
print("\nFinal Processed Features (X_processed):")
print(X_processed.head())
# Convert to NumPy array if needed for specific frameworks
X_processed_np = X_processed.to_numpy()
print("\nFinal Processed Features as NumPy array (first 5 rows):")
print(X_processed_np[:5])
print("\nShape of final processed features:", X_processed_np.shape)
Our final processed feature set X_processed
now has 4 columns (2 scaled numerical, 2 one-hot encoded categorical) and is purely numerical, ready for training.
While this section focuses on preprocessing transformations, a standard next step is to split the data into training, validation, and test sets, as discussed previously. This ensures we can train the model, tune its hyperparameters, and evaluate its final performance on unseen data. scikit-learn
provides the train_test_split
function for this.
from sklearn.model_selection import train_test_split
# Split the *processed* features (X_processed) and the original target (y)
# Typically split into training and a temporary set, then split temp into validation and test
# For simplicity, we'll do a single split into train and test here.
X_train, X_test, y_train, y_test = train_test_split(
X_processed_np, # Use the final numpy array
y.to_numpy(), # Convert target Series to numpy array
test_size=0.2, # Reserve 20% for the test set
random_state=42 # Ensures reproducibility of the split
)
print("\nShape of Training Features:", X_train.shape)
print("Shape of Test Features:", X_test.shape)
print("Shape of Training Target:", y_train.shape)
print("Shape of Test Target:", y_test.shape)
In this hands-on practical, we took a raw sample dataset and applied essential preprocessing steps:
StandardScaler
to numerical features to center them around zero with unit variance.OneHotEncoder
to convert categorical features into a numerical format.This processed, split data (X_train
, y_train
, X_test
, y_test
) is now in the appropriate format to be used for training and evaluating a neural network, which we will cover in subsequent chapters. Mastering these preprocessing steps is fundamental for building effective machine learning models.
© 2025 ApX Machine Learning