Before a neural network can learn effectively, the data fed into it must be in the right shape and format. Raw datasets often contain features with varying scales, categorical values, or missing entries, none of which are directly suitable for the mathematical operations within a network. Preparing your data correctly is a foundational step in the deep learning workflow, directly impacting model convergence speed and overall performance. Think of it as preparing the ingredients before cooking; without proper preparation, the final dish is unlikely to succeed.
This section covers essential techniques for transforming raw data into a format suitable for training deep neural networks using modern frameworks. We'll focus on formatting, scaling numerical features, and splitting the data for robust training and evaluation.
Deep learning frameworks like PyTorch and TensorFlow primarily operate on multi-dimensional arrays called tensors. You'll typically represent your input data (features) and target data (labels or target values) as tensors.
Most frameworks provide seamless conversion from common data structures like NumPy arrays or Python lists into their native tensor formats. For instance, in PyTorch:
import torch
import numpy as np
# Example: Convert a NumPy array to a PyTorch tensor
numpy_array = np.array([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
feature_tensor = torch.from_numpy(numpy_array).float() # Ensure float type for NN
print(feature_tensor)
# Output:
# tensor([[1., 2.],
# [3., 4.],
# [5., 6.]])
Neural networks are sensitive to the scale of input features. Features with large values can dominate the learning process, leading to slower convergence or preventing the network from learning effectively. Gradient descent, the core optimization algorithm, often performs better when features are on a similar scale. Two common scaling techniques are normalization and standardization.
Normalization rescales features to a fixed range, typically [0, 1] or [-1, 1]. The formula for scaling to [0, 1] is:
Xscaled=Xmax−XminX−XminWhere Xmin and Xmax are the minimum and maximum values of the feature across the training dataset.
Standardization rescales features to have a mean (μ) of 0 and a standard deviation (σ) of 1. The formula is:
Xscaled=σX−μHere, μ and σ are the mean and standard deviation calculated from the training dataset.
Original data points compared with their positions after Min-Max Normalization and Standardization. Notice how the outlier (15, 30) compresses the normalized data, while standardization handles it differently.
A significant point often missed by beginners is how to apply scaling when you have separate training, validation, and test datasets.
Libraries like scikit-learn provide convenient tools for this:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np
import torch
# Sample data (replace with your actual data)
X = np.array([[10.0, 0.1], [12.0, 0.2], [15.0, 0.15], [9.0, 0.3], [11.0, 0.05], [18.0, 0.25]])
y = np.array([0, 0, 1, 0, 1, 1])
# 1. Split data first
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42) # 60% train
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42) # 20% val, 20% test
# 2. Initialize and fit scaler ONLY on training data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # Fit and transform
# 3. Transform validation and test data using the SAME fitted scaler
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)
# Convert scaled data to PyTorch tensors (example)
X_train_tensor = torch.from_numpy(X_train_scaled).float()
y_train_tensor = torch.from_numpy(y_train).long() # Use long for integer labels
print("Original Training Data Sample:\n", X_train[0])
print("Scaled Training Data Sample:\n", X_train_scaled[0])
print("\nScaled Validation Data Sample:\n", X_val_scaled[0])
# Example Output:
# Original Training Data Sample:
# [11. 0.05]
# Scaled Training Data Sample:
# [-0.3380617 -1.34164079]
# Scaled Validation Data Sample: # (Note: Uses scaling params from training data)
# [ 0.16903085 -0.4472136 ]
Before training, you must divide your dataset into distinct subsets:
Why the strict separation? If you tune your hyperparameters based on the test set performance, you are implicitly fitting your model selection process to that specific test data. Your test set performance estimate will then be overly optimistic, and the model might not generalize as well to truly unseen data. Common splits include 70% training, 15% validation, 15% test, or 80% training, 10% validation, 10% test, but the optimal ratio can depend on the total dataset size. For very large datasets, the validation and test sets can sometimes be smaller percentages.
Typical workflow for splitting data into training, validation, and test sets. The test set remains untouched until final evaluation.
Proper data preparation, including formatting, scaling, and splitting, is not merely a preliminary step but an integral part of building successful deep learning models. Getting this right ensures that your network can learn efficiently and that you obtain reliable estimates of its performance.
© 2025 ApX Machine Learning