All Courses

Preparing Data for Neural Networks

Before a neural network can learn effectively, the data fed into it must be in the right shape and format. Raw datasets often contain features with varying scales, categorical values, or missing entries, none of which are directly suitable for the mathematical operations within a network. Preparing your data correctly is a foundational step in the deep learning workflow, directly impacting model convergence speed and overall performance. Think of it as preparing the ingredients before cooking; without proper preparation, the final dish is unlikely to succeed.

This section covers essential techniques for transforming raw data into a format suitable for training deep neural networks using modern frameworks. We'll focus on formatting, scaling numerical features, and splitting the data for training and evaluation.

Data Formatting: Tensors and Structure

Deep learning frameworks like PyTorch and TensorFlow primarily operate on multi-dimensional arrays called tensors. You'll typically represent your input data (features) and target data (labels or target values) as tensors.

Features ( $X$ ): Usually a 2D tensor where rows represent individual samples (e.g., images, user profiles, sensor readings) and columns represent different features (e.g., pixel values, user age, temperature). The shape is often (number of samples, number of features). For specific data types like images or sequences, the dimensionality might increase (e.g., (samples, height, width, channels) for images).
Labels ( $y$ ): Typically a 1D tensor for regression or classification with a single output, or a 2D tensor for multi-class classification (often after one-hot encoding). The shape usually aligns with the number of samples, e.g., (number of samples,) or (number of samples, number of classes).

Most frameworks provide seamless conversion from common data structures like NumPy arrays or Python lists into their native tensor formats. For instance, in PyTorch:

import torch
import numpy as np

# Example: Convert a NumPy array to a PyTorch tensor
numpy_array = np.array([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
feature_tensor = torch.from_numpy(numpy_array).float() # Ensure float type for NN

print(feature_tensor)
# Output:
# tensor([[1., 2.],
#         [3., 4.],
#         [5., 6.]])

Feature Scaling: Normalization and Standardization

Neural networks are sensitive to the scale of input features. Features with large values can dominate the learning process, leading to slower convergence or preventing the network from learning effectively. Gradient descent, the core optimization algorithm, often performs better when features are on a similar scale. Two common scaling techniques are normalization and standardization.

Normalization (Min-Max Scaling)

Normalization rescales features to a fixed range, typically [0, 1] or [-1, 1]. The formula for scaling to [0, 1] is:

X_{\text{scaled}} = \frac{X - X_{\min}}{X_{\max} - X_{\min}}

Where $X_{\min}$ and $X_{\max}$ are the minimum and maximum values of the feature across the training dataset.

Pros: Guarantees data falls within a specific range. Useful for algorithms requiring inputs in a bounded interval or for image data where pixel values are often scaled to [0, 1].
Cons: Sensitive to outliers. A single extreme value can significantly compress the rest of the data into a very small range.

Standardization (Z-score Normalization)

Standardization rescales features to have a mean ( $\mu$ ) of 0 and a standard deviation ( $\sigma$ ) of 1. The formula is:

X_{\text{scaled}} = \frac{X - \mu}{\sigma}

Here, $\mu$ and $\sigma$ are the mean and standard deviation calculated from the training dataset.

Pros: Less sensitive to outliers compared to normalization. Often preferred for algorithms like gradient descent that assume features are centered around zero.
Cons: Does not guarantee a bounded range. Scaled values can be positive or negative and may fall outside [-1, 1].

Original data points compared with their positions after Min-Max Normalization and Standardization. Notice how the outlier (15, 30) compresses the normalized data, while standardization handles it differently.

Applying Scalers Correctly

A significant point often missed by beginners is how to apply scaling when you have separate training, validation, and test datasets.

Fit the scaler ONLY on the training data: Calculate $X_{\min}$ , $X_{\max}$ , $\mu$ , and $\sigma$ using only the training samples. This prevents information from the validation or test sets from "leaking" into the training process, which would bias your evaluation.
Transform all datasets (training, validation, test) using the fitted scaler: Apply the same scaling transformation (using the parameters learned from the training data) to the training, validation, and test sets.

Libraries like scikit-learn provide convenient tools for this:

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np
import torch

# Sample data (replace with your actual data)
X = np.array([[10.0, 0.1], [12.0, 0.2], [15.0, 0.15], [9.0, 0.3], [11.0, 0.05], [18.0, 0.25]])
y = np.array([0, 0, 1, 0, 1, 1])

# 1. Split data first
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42) # 60% train
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42) # 20% val, 20% test

# 2. Initialize and fit scaler ONLY on training data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # Fit and transform

# 3. Transform validation and test data using the SAME fitted scaler
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

# Convert scaled data to PyTorch tensors (example)
X_train_tensor = torch.from_numpy(X_train_scaled).float()
y_train_tensor = torch.from_numpy(y_train).long() # Use long for integer labels

print("Original Training Data Sample:\n", X_train[0])
print("Scaled Training Data Sample:\n", X_train_scaled[0])
print("\nScaled Validation Data Sample:\n", X_val_scaled[0])

# Example Output:
# Original Training Data Sample:
#  [11.    0.05]
# Scaled Training Data Sample:
#  [-0.3380617   -1.34164079]

# Scaled Validation Data Sample: # (Note: Uses scaling params from training data)
#  [ 0.16903085 -0.4472136 ]

Data Splitting: Training, Validation, and Test Sets

Before training, you must divide your dataset into distinct subsets:

Training Set: The largest portion of the data, used by the model to learn the underlying patterns by adjusting its weights and biases via backpropagation and optimization algorithms.
Validation Set (or Development Set): Used periodically during training to tune hyperparameters (like learning rate, number of layers, regularization strength) and make decisions about the model architecture. It provides an unbiased estimate of how the model performs on data it hasn't been trained on during the development process. This helps detect overfitting early (using techniques like early stopping).
Test Set: Held back completely until the very end of the development process. Once you have selected your final model architecture and hyperparameters (based on performance on the validation set), you evaluate its performance one last time on the test set. This gives the most realistic estimate of how your model will perform on new, unseen data in reality.

Why the strict separation? If you tune your hyperparameters based on the test set performance, you are implicitly fitting your model selection process to that specific test data. Your test set performance estimate will then be overly optimistic, and the model might not generalize as well to truly unseen data. Common splits include 70% training, 15% validation, 15% test, or 80% training, 10% validation, 10% test, but the optimal ratio can depend on the total dataset size. For very large datasets, the validation and test sets can sometimes be smaller percentages.

Typical workflow for splitting data into training, validation, and test sets. The test set remains untouched until final evaluation.

Proper data preparation, including formatting, scaling, and splitting, is not merely a preliminary step but an integral part of building successful deep learning models. Getting this right ensures that your network can learn efficiently and that you obtain reliable estimates of its performance.

Was this section helpful?