As discussed in the chapter introduction, many machine learning algorithms perform better or converge faster when features are on a relatively similar scale. Algorithms that compute distances between data points (like K-Nearest Neighbors) or rely on gradient descent optimization (like linear regression, logistic regression, neural networks) are particularly sensitive to the scale of input features. If one feature ranges from 0 to 1 and another ranges from 0 to 1,000,000, the algorithm might incorrectly assign more importance to the feature with the larger range simply due to its scale, not its predictive value.
Standardization, often referred to as Z-score scaling, is a common and effective technique to address this. It transforms your data such that it has a mean (μ) of 0 and a standard deviation (σ) of 1.
The transformation for each value x in a feature is calculated using the following formula:
Z=σx−μWhere:
Each transformed value represents the number of standard deviations the original value was away from the mean. Values greater than the mean will be positive, values less than the mean will be negative, and a value equal to the mean will be zero.
Scikit-learn provides a convenient transformer class, StandardScaler
, within its preprocessing
module. Like other Scikit-learn transformers, it follows the fit
and transform
pattern.
fit
method calculates the mean (μ) and standard deviation (σ) for each feature in the training data. These calculated parameters are stored within the scaler object. It's important to only fit the scaler on the training data to prevent data leakage from the test set.transform
method uses the learned μ and σ (from the fit
step) to apply the standardization formula to the data, creating the scaled features. You will use this method on both the training data and, later, on any new data (like the validation or test set) before feeding it to your model.Let's see it in action. Assume we have a simple dataset with 'Age' and 'Income' features:
import pandas as pd
from sklearn.preprocessing import StandardScaler
# Sample Data
data = {'Age': [25, 30, 35, 40, 45, 50, 55, 60],
'Income': [50000, 55000, 60000, 65000, 70000, 75000, 80000, 85000]}
df = pd.DataFrame(data)
print("Original Data:")
print(df)
# 1. Initialize the Scaler
scaler = StandardScaler()
# 2. Fit the scaler on the data (calculates mean and std dev)
# In a real scenario, fit ONLY on training data
scaler.fit(df)
# 3. Transform the data (applies the scaling)
scaled_data = scaler.transform(df)
# Convert back to DataFrame for better readability
scaled_df = pd.DataFrame(scaled_data, columns=df.columns)
print("\nScaled Data (Standardization):")
print(scaled_df)
# You can inspect the learned parameters
print(f"\nLearned Mean: {scaler.mean_}")
print(f"Learned Scale (Std Dev): {scaler.scale_}") # scale_ is the standard deviation
Output:
Original Data:
Age Income
0 25 50000
1 30 55000
2 35 60000
3 40 65000
4 45 70000
5 50 75000
6 55 80000
7 60 85000
Scaled Data (Standardization):
Age Income
0 -1.527525 -1.527525
1 -1.091089 -1.091089
2 -0.654654 -0.654654
3 -0.218218 -0.218218
4 0.218218 0.218218
5 0.654654 0.654654
6 1.091089 1.091089
7 1.527525 1.527525
Learned Mean: [ 42.5 67500. ]
Learned Scale (Std Dev): [ 11.45643924 11456.4392401 ]
Notice how the scaled features are now centered around zero. The exact values reflect their original position relative to the mean, measured in standard deviations.
Standardization changes the scale of the data but preserves the shape of its distribution. If a feature was skewed before standardization, it will still be skewed afterwards, just on a different scale.
Distribution of the 'Age' feature before (left, blue) and after (right, orange) standardization. Note that the shape of the histogram is the same, but the x-axis scale has changed to reflect the Z-scores centered around 0.
Standardization is a foundational technique for preparing numerical features. By centering the data around zero and scaling it based on its standard deviation, you make it more suitable for a wide range of machine learning algorithms, particularly those sensitive to feature scales. Remember to fit the StandardScaler
only on your training data and then use it to transform both your training and test/validation sets.
© 2025 ApX Machine Learning