Preparing data for neural networks involves scaling numerical features and encoding categorical ones. A typical preprocessing workflow will be demonstrated using Python, focusing on the pandas library for data manipulation and scikit-learn for preprocessing tools.Imagine we have a small dataset designed to predict whether a student will pass an exam based on their study hours, their score on a previous related quiz, and the primary method they used to study.Setting up the Sample DataFirst, let's create a sample dataset using pandas. This mimics how you might load data from a CSV file or database in a real project.import pandas as pd import numpy as np # Create sample data data = { 'Study_Hours': [2.5, 5.1, 1.3, 8.7, 4.5, 9.8, 3.3, 6.1, 7.0, 0.5], 'Previous_Score': [65, 82, 50, 95, 75, 88, 68, 85, 91, 45], 'Study_Method': ['Solo', 'Group', 'Solo', 'Group', 'Group', 'Group', 'Solo', 'Solo', 'Group', 'Solo'], 'Passed_Exam': [0, 1, 0, 1, 1, 1, 0, 1, 1, 0] # Target variable (0=Fail, 1=Pass) } df = pd.DataFrame(data) print("Original DataFrame:") print(df)This gives us a DataFrame df with our features (Study_Hours, Previous_Score, Study_Method) and a target variable (Passed_Exam) that we want to predict.Identifying Feature TypesBefore applying any transformations, we separate our features from the target variable and identify which features are numerical and which are categorical.# Separate features (X) and target (y) X = df.drop('Passed_Exam', axis=1) y = df['Passed_Exam'] # Identify numerical and categorical features numerical_features = ['Study_Hours', 'Previous_Score'] categorical_features = ['Study_Method'] print("\nFeatures (X):") print(X.head()) print("\nTarget (y):") print(y.head())Scaling Numerical FeaturesAs discussed in the section on Feature Scaling, numerical features often benefit from scaling, especially for algorithms sensitive to feature magnitudes like neural networks. Let's apply Standardization (Z-score scaling) using scikit-learn's StandardScaler. This transforms the data such that it has a mean of 0 and a standard deviation of 1 ($x' = \frac{x - \mu}{\sigma}$).from sklearn.preprocessing import StandardScaler # Initialize the scaler scaler = StandardScaler() # Fit the scaler to the numerical data and transform it X_numerical_scaled = scaler.fit_transform(X[numerical_features]) # Convert back to DataFrame for better readability (optional) X_numerical_scaled_df = pd.DataFrame(X_numerical_scaled, columns=numerical_features, index=X.index) print("\nNumerical Features Before Scaling:") print(X[numerical_features].describe()) print("\nNumerical Features After Standardization:") print(X_numerical_scaled_df.describe()) # Note: Mean is approx 0 and Std Dev is approx 1 after scalingYou'll notice the mean is very close to zero and the std (standard deviation) is close to one for both Study_Hours and Previous_Score after scaling.Let's visualize the distribution of Previous_Score before and after scaling using a histogram.{"layout": {"title": "Distribution of Previous Score Before and After Standardization", "xaxis": {"title": "Previous Score (Scaled)"}, "yaxis": {"title": "Frequency"}, "barmode": "overlay", "legend": {"traceorder": "reversed"}}, "data": [{"type": "histogram", "x": [65, 82, 50, 95, 75, 88, 68, 85, 91, 45], "name": "Original", "opacity": 0.75, "marker": {"color": "#74c0fc"}}, {"type": "histogram", "x": [-0.57735027, 0.46904158, -1.46059349, 1.23369788, 0.0, 0.82082276, -0.38490018, 0.64491897, 0.99672657, -1.74546381], "name": "Standardized", "opacity": 0.75, "marker": {"color": "#ff922b"}}]}Comparison of the distribution for the 'Previous_Score' feature before (blue) and after (orange) applying StandardScaler. The scaled distribution is centered around zero.Alternatively, we could have used MinMaxScaler to scale the data into a specific range, typically [0, 1], using the formula $x' = \frac{x - min(x)}{max(x) - min(x)}$. The choice often depends on the specific data and network architecture, but StandardScaler is a common default.Encoding Categorical FeaturesOur Study_Method column contains text categories ('Solo', 'Group'). Neural networks require numerical input, so we need to encode this feature. One-hot encoding is a standard technique for nominal categorical variables (where there's no inherent order). It creates new binary (0 or 1) columns for each category.We'll use scikit-learn's OneHotEncoder.from sklearn.preprocessing import OneHotEncoder # Initialize the encoder # handle_unknown='ignore' prevents errors if unseen categories appear later (e.g., in test data) # sparse_output=False returns a dense numpy array instead of a sparse matrix encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False) # Fit the encoder to the categorical data and transform it X_categorical_encoded = encoder.fit_transform(X[categorical_features]) # Get the new feature names generated by the encoder encoded_feature_names = encoder.get_feature_names_out(categorical_features) # Convert to DataFrame (optional, for readability) X_categorical_encoded_df = pd.DataFrame(X_categorical_encoded, columns=encoded_feature_names, index=X.index) print("\nCategorical Features Before Encoding:") print(X[categorical_features].head()) print("\nCategorical Features After One-Hot Encoding:") print(X_categorical_encoded_df.head())As you can see, the single Study_Method column is replaced by two new columns: Study_Method_Group and Study_Method_Solo. A '1' in a column indicates the presence of that category for the given sample.Combining Processed FeaturesNow, we combine our scaled numerical features and our encoded categorical features into a single array or DataFrame. This combined dataset is what we would typically feed into a neural network.# Concatenate scaled numerical and encoded categorical features X_processed = pd.concat([X_numerical_scaled_df, X_categorical_encoded_df], axis=1) print("\nFinal Processed Features (X_processed):") print(X_processed.head()) # Convert to NumPy array if needed for specific frameworks X_processed_np = X_processed.to_numpy() print("\nFinal Processed Features as NumPy array (first 5 rows):") print(X_processed_np[:5]) print("\nShape of final processed features:", X_processed_np.shape)Our final processed feature set X_processed now has 4 columns (2 scaled numerical, 2 one-hot encoded categorical) and is purely numerical, ready for training.Splitting Data for Training and EvaluationWhile this section focuses on preprocessing transformations, a standard next step is to split the data into training, validation, and test sets, as discussed previously. This ensures we can train the model, tune its hyperparameters, and evaluate its final performance on unseen data. scikit-learn provides the train_test_split function for this.from sklearn.model_selection import train_test_split # Split the *processed* features (X_processed) and the original target (y) # Typically split into training and a temporary set, then split temp into validation and test # For simplicity, we'll do a single split into train and test here. X_train, X_test, y_train, y_test = train_test_split( X_processed_np, # Use the final numpy array y.to_numpy(), # Convert target Series to numpy array test_size=0.2, # Reserve 20% for the test set random_state=42 # Ensures reproducibility of the split ) print("\nShape of Training Features:", X_train.shape) print("Shape of Test Features:", X_test.shape) print("Shape of Training Target:", y_train.shape) print("Shape of Test Target:", y_test.shape)SummaryIn this hands-on practical, we took a raw sample dataset and applied essential preprocessing steps:Identified numerical and categorical features.Applied StandardScaler to numerical features to center them around zero with unit variance.Applied OneHotEncoder to convert categorical features into a numerical format.Combined the transformed features into a single dataset.Performed a train-test split on the processed data.This processed, split data (X_train, y_train, X_test, y_test) is now in the appropriate format to be used for training and evaluating a neural network, which we will cover in subsequent chapters. Mastering these preprocessing steps is fundamental for building effective machine learning models.