As mentioned earlier, neural networks perform mathematical operations on input data. The scale and distribution of this data can significantly impact the network's ability to learn effectively. Imagine a dataset with two features: one ranging from 0 to 1, and another ranging from 1,000 to 100,000. During training, particularly when using gradient-based optimization methods, the feature with the larger values might disproportionately influence the updates to the network's weights, potentially slowing down convergence or even preventing the network from finding a good solution. Feature scaling addresses this by transforming the data so that all features contribute more equally to the learning process.
Two common techniques for feature scaling are Normalization (often called Min-Max Scaling) and Standardization (or Z-score Normalization). Let's examine each.
Normalization rescales features to a fixed range, typically [0, 1] or sometimes [-1, 1]. The formula for scaling to [0, 1] is:
x′=max(x)−min(x)x−min(x)Here, x is the original feature value, min(x) is the minimum value of that feature in the dataset, and max(x) is the maximum value. Each value in the feature column is transformed into a new value x′ between 0 and 1.
Advantages:
Disadvantages:
Consider a feature with values ranging from 10 to 110. The minimum is 10, and the maximum is 110.
If an outlier of 1000 appeared, the maximum becomes 1000. Now, the value 60 normalizes to (60−10)/(1000−10)=50/990≈0.05, compressing the original range significantly.
Standardization rescales features so they have a mean (μ) of 0 and a standard deviation (σ) of 1. The formula is:
x′=σx−μHere, x is the original value, μ is the mean of the feature values, and σ is the standard deviation of the feature values. The resulting value x′ represents the number of standard deviations the original value x was away from the mean.
Advantages:
Disadvantages:
If a feature has a mean of 50 and a standard deviation of 15:
The following visualization shows the effect of Normalization and Standardization on a synthetic dataset with two features having different scales.
Scatter plot of original data with Feature 1 ranging roughly 20-80 and Feature 2 ranging roughly 1k-10k. Notice the vastly different scales on the axes.
Effect of Normalization (top right) scaling data to [0, 1] on both axes, and Standardization (bottom left) centering data around (0, 0) with unit variance. The relative positions change slightly due to the outlier's influence.
There's no single definitive answer, but here are some guidelines:
In practice, standardization (x′=σx−μ) is the more common choice for preparing data for deep neural networks.
A frequent mistake is to calculate scaling parameters (min/max or mean/std dev) across the entire dataset before splitting it into training, validation, and test sets. This introduces data leakage, where information from the validation and test sets influences the transformation applied to the training set, leading to overly optimistic performance estimates.
The correct procedure is:
This ensures that the model evaluation on the validation and test sets reflects how the model would perform on new, unseen data that has been processed using the knowledge gained solely from the training data. Libraries like scikit-learn provide tools (StandardScaler
, MinMaxScaler
) that facilitate this correct workflow using their fit
(on training data) and transform
(on all splits) methods.
© 2025 ApX Machine Learning