As you've explored your data through visualization and analysis, you've likely noticed that different numerical features can have vastly different ranges, units, and distributions. For instance, one feature might represent age (ranging from 18 to 80), while another represents income (ranging from 20,000 to 200,000). Many machine learning algorithms, particularly those based on distance calculations (like K-Nearest Neighbors, Support Vector Machines) or gradient descent (like linear regression, neural networks), perform better or converge faster when numerical input features are on a similar scale. This section introduces two common techniques for rescaling features: Standardization and Min-Max Scaling.
It's important to note that these transformations change the scale and location (mean) of the data but generally do not change the shape of its distribution. If a feature has a skewed distribution, scaling won't make it normally distributed. Addressing skewness often requires different transformations (like log or Box-Cox transforms), which are typically considered more advanced feature engineering steps.
Standardization rescales data so that it has a mean (μ) of 0 and a standard deviation (σ) of 1. This transformation is often called Z-score normalization. The formula for standardization is:
Xscaled=σX−μWhere X is the original feature value, μ is the mean of the feature column, and σ is the standard deviation of the feature column.
Why use Standardization?
Let's see how to apply Standardization using Scikit-learn's StandardScaler
.
import pandas as pd
from sklearn.preprocessing import StandardScaler
import numpy as np
# Sample DataFrame
data = {'Age': [25, 30, 35, 40, 45, 50],
'Salary': [50000, 60000, 75000, 90000, 110000, 150000]}
df = pd.DataFrame(data)
print("Original Data:")
print(df)
# Initialize the scaler
scaler = StandardScaler()
# Fit the scaler to the data and transform it
# Note: fit_transform expects a 2D array, hence df[['Age', 'Salary']]
scaled_data = scaler.fit_transform(df[['Age', 'Salary']])
# Convert back to DataFrame for better readability
df_scaled = pd.DataFrame(scaled_data, columns=['Age_scaled', 'Salary_scaled'])
print("\nStandardized Data:")
print(df_scaled)
print(f"\nMean after scaling:\n{df_scaled.mean()}")
print(f"\nStandard Deviation after scaling:\n{df_scaled.std()}")
Notice how the means are very close to 0 and the standard deviations are 1 after scaling.
Min-Max scaling rescales the data to a fixed range, typically [0, 1], although other ranges can be specified. The formula for Min-Max scaling to the [0, 1] range is:
Xscaled=Xmax−XminX−XminWhere X is the original feature value, Xmin is the minimum value in the feature column, and Xmax is the maximum value in the feature column.
Why use Min-Max Scaling?
However, Min-Max scaling is quite sensitive to outliers. A single very large or very small value can significantly compress the rest of the data into a narrow range.
Here's how to apply Min-Max scaling using Scikit-learn's MinMaxScaler
.
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
import numpy as np
# Sample DataFrame (same as before)
data = {'Age': [25, 30, 35, 40, 45, 50],
'Salary': [50000, 60000, 75000, 90000, 110000, 150000]}
df = pd.DataFrame(data)
print("Original Data:")
print(df)
# Initialize the scaler (default range is [0, 1])
min_max_scaler = MinMaxScaler()
# Fit and transform
scaled_data_minmax = min_max_scaler.fit_transform(df[['Age', 'Salary']])
# Convert back to DataFrame
df_scaled_minmax = pd.DataFrame(scaled_data_minmax, columns=['Age_scaled', 'Salary_scaled'])
print("\nMin-Max Scaled Data:")
print(df_scaled_minmax)
print(f"\nMin after scaling:\n{df_scaled_minmax.min()}")
print(f"\nMax after scaling:\n{df_scaled_minmax.max()}")
As expected, the minimum values are now 0 and the maximum values are 1 for both scaled columns.
The choice often depends on the specific algorithm you plan to use and the nature of your data:
Consider the distributions you observed during univariate analysis. If a feature has extreme outliers, standardization might be a safer default choice. If the features have very different scales but no extreme outliers, Min-Max scaling can work well.
Let's visualize the effect of scaling on the 'Salary' feature from our example data. We'll use histograms.
Comparison of Salary distribution before and after Standardization and Min-Max scaling. Notice the x-axis changes significantly, reflecting the new scales (approx. -1 to 2 for Standardized, 0 to 1 for Min-Max). The overall shape of the distribution remains similar, only the scale and location are altered.
When preparing data for machine learning models, it's essential to prevent data leakage. This means that information from the test set should not influence the training process, including the parameters used for scaling (like the mean, standard deviation, min, or max).
The correct procedure is:
StandardScaler
or MinMaxScaler
) only on the training data. This learns the scaling parameters (μ, σ, min, max) from the training data.# Assume X_train, X_test are your feature sets (Pandas DataFrames or NumPy arrays)
# scaler = StandardScaler() # or MinMaxScaler()
# Fit ONLY on training data
# scaler.fit(X_train)
# Transform both training and test data
# X_train_scaled = scaler.transform(X_train)
# X_test_scaled = scaler.transform(X_test)
Applying transformations like scaling is a direct outcome of the understanding gained during EDA. By examining feature ranges, distributions, and considering the requirements of subsequent modeling steps, you can make informed decisions about how to best prepare your data.
© 2025 ApX Machine Learning