As discussed, many machine learning algorithms perform better when numerical input features are scaled to a standard range. Scikit-learn provides convenient tools called transformers to perform these preprocessing steps. These transformers follow a consistent API, making them easy to integrate into your workflow.
The core methods you'll use with transformers like scalers are:
fit(X)
: This method learns the parameters required for the transformation from the input data X
. For example, StandardScaler
calculates the mean (μ) and standard deviation (σ) of each feature in X
. Importantly, you should only call fit
on your training data to avoid data leakage from the test set.transform(X)
: This method applies the learned transformation to the input data X
. It uses the parameters computed during the fit
step. You will use this method to transform both your training and testing data (and any new data).fit_transform(X)
: This is a convenience method that performs both fitting and transforming in a single step on the same data X
. It's often used on the training set for efficiency. However, remember to use only the transform
method on the test set, using the scaler fitted on the training data.Let's see how to apply the common scalers using this API.
StandardScaler
transforms your data such that it has a mean of 0 and a standard deviation of 1. This process is often called standardization. The formula applied to each feature is:
z=σx−μ
where x is the original feature value, μ is the mean of the feature, and σ is its standard deviation.
Here's how you can use it:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
# Sample data (e.g., two features)
data = pd.DataFrame({
'FeatureA': [10, 20, 30, 40, 50],
'FeatureB': [100, 110, 90, 120, 105]
})
# 1. Initialize the scaler
scaler = StandardScaler()
# 2. Fit the scaler to the data and transform it
# (Using fit_transform for demonstration on the 'training' data)
scaled_data = scaler.fit_transform(data)
# The output is a NumPy array
print("Original Data:\n", data)
print("\nScaled Data (StandardScaler):\n", scaled_data)
# To see the learned parameters:
print("\nMean learned by StandardScaler:", scaler.mean_)
print("Scale (Std Dev) learned by StandardScaler:", scaler.scale_)
The scaled_data
NumPy array now contains the standardized values for FeatureA
and FeatureB
. Notice how the output is a NumPy array, even if the input was a Pandas DataFrame. If you need to work with a DataFrame, you can convert it back:
scaled_df_standard = pd.DataFrame(scaled_data, columns=data.columns)
print("\nScaled Data as DataFrame (StandardScaler):\n", scaled_df_standard)
MinMaxScaler
scales the data to a fixed range, typically [0,1]. It calculates the transformation based on the minimum and maximum values of each feature:
xscaled=max(x)−min(x)x−min(x)
This is useful when you need features bounded within a specific range, although it can be sensitive to outliers because the min and max values dictate the scaling.
from sklearn.preprocessing import MinMaxScaler
# Using the same sample data
data = pd.DataFrame({
'FeatureA': [10, 20, 30, 40, 50],
'FeatureB': [100, 110, 90, 120, 105]
})
# 1. Initialize the scaler (default range is [0, 1])
min_max_scaler = MinMaxScaler()
# You can specify a range, e.g., MinMaxScaler(feature_range=(-1, 1))
# 2. Fit and transform
scaled_data_minmax = min_max_scaler.fit_transform(data)
print("Original Data:\n", data)
print("\nScaled Data (MinMaxScaler):\n", scaled_data_minmax)
scaled_df_minmax = pd.DataFrame(scaled_data_minmax, columns=data.columns)
print("\nScaled Data as DataFrame (MinMaxScaler):\n", scaled_df_minmax)
Observe that all values in scaled_data_minmax
now fall between 0 and 1.
RobustScaler
uses statistics that are robust to outliers. Instead of using mean and standard deviation (like StandardScaler
) or min and max (like MinMaxScaler
), it uses the median and the interquartile range (IQR). The IQR is the range between the 1st quartile (25th percentile) and the 3rd quartile (75th percentile).
The transformation is:
xscaled=IQRx−median(x)
This makes RobustScaler
a good choice when your dataset contains significant outliers that might unduly influence StandardScaler
or MinMaxScaler
.
from sklearn.preprocessing import RobustScaler
# Sample data with an outlier in FeatureB
data_outlier = pd.DataFrame({
'FeatureA': [10, 20, 30, 40, 50],
'FeatureB': [100, 110, 90, 120, 500] # Added an outlier
})
# 1. Initialize the scaler
robust_scaler = RobustScaler()
# 2. Fit and transform
scaled_data_robust = robust_scaler.fit_transform(data_outlier)
print("Original Data with Outlier:\n", data_outlier)
print("\nScaled Data (RobustScaler):\n", scaled_data_robust)
scaled_df_robust = pd.DataFrame(scaled_data_robust, columns=data_outlier.columns)
print("\nScaled Data as DataFrame (RobustScaler):\n", scaled_df_robust)
# Compare with StandardScaler on the same outlier data
scaler_std = StandardScaler()
scaled_data_std_outlier = scaler_std.fit_transform(data_outlier)
print("\nScaled Data with Outlier (StandardScaler):\n", scaled_data_std_outlier)
Notice how the scaling of FeatureB
differs between RobustScaler
and StandardScaler
when an outlier is present. RobustScaler
centers the data around the median and scales by the IQR, making the non-outlier points less affected by the extreme value (500). StandardScaler
, influenced by the outlier, results in the other points being clustered more closely together after scaling.
The following chart visualizes the distribution of 'FeatureB' from the data_outlier
example before scaling, after StandardScaler
, and after RobustScaler
.
Comparison of 'FeatureB' distributions before and after applying StandardScaler and RobustScaler. Note how RobustScaler maintains a better separation between the bulk of the data and the outlier compared to StandardScaler.
Choosing the right scaler depends on the specifics of your data and the requirements of the machine learning algorithm you plan to use. StandardScaler
is a common default, but MinMaxScaler
is useful for algorithms requiring bounded inputs, and RobustScaler
is preferable when dealing with outliers. Remember the fundamental rule: fit on training data, transform both training and test data. Scikit-learn's consistent transformer API makes switching between these techniques straightforward.
© 2025 ApX Machine Learning