As highlighted in the chapter introduction, many machine learning algorithms perform significantly better when numerical input features are on a similar scale. Think about algorithms like K-Nearest Neighbors (KNN) that calculate distances, or optimization methods like gradient descent used in linear models and neural networks. If one feature ranges from 0 to 1, while another ranges from 0 to 1,000,000, the algorithm might inadvertently give much more weight to the feature with the larger range simply because of its scale, not necessarily its importance. Feature scaling addresses this by transforming the data so that features have comparable ranges or distributions.
Scikit-learn offers several tools for feature scaling, primarily through its transformer API within the sklearn.preprocessing
module. Let's examine the most common techniques:
Standardization rescales data so that it has a mean (μ) of 0 and a standard deviation (σ) of 1. This process is often called Z-score normalization. For each feature value x, the standardized value z is calculated as:
z=σx−μHere, μ is the mean of the feature values, and σ is the standard deviation.
Standardization does not bind values to a specific range (like [0, 1]), which might be a misconception. It centers the data around zero and scales it based on the standard deviation. This method is widely used and is often beneficial for algorithms that assume data is centered around zero or follows a Gaussian distribution (although standardization itself doesn't guarantee a Gaussian distribution). It's less affected by outliers than min-max scaling, but significant outliers can still influence the calculated mean and standard deviation.
Common Use Cases:
Min-Max scaling, often simply called normalization, transforms features by scaling each feature to a given range, typically [0, 1] or [-1, 1]. The transformation for scaling to [0, 1] is given by the formula:
Xscaled=Xmax−XminX−XminWhere Xmin and Xmax are the minimum and maximum values of the feature, respectively.
This approach guarantees that all features will have the exact same scale. However, it's quite sensitive to outliers. A single very large or very small value can drastically compress the rest of the data into a very small portion of the [0, 1] range.
Common Use Cases:
When your dataset contains significant outliers, both Standardization and Min-Max scaling can be problematic. Outliers can heavily influence the mean/standard deviation (for Standardization) or the min/max values (for Min-Max scaling). RobustScaler
uses statistics that are more resistant, or robust, to outliers.
It works by removing the median and scaling the data according to the Interquartile Range (IQR). The IQR is the range between the 1st quartile (25th percentile) and the 3rd quartile (75th percentile). The transformation subtracts the median (Q2) and divides by the IQR (Q3−Q1):
Xscaled=Q3(X)−Q1(X)X−Q2(X)By using the median and IQR, RobustScaler
centers the data and scales it without being unduly influenced by extreme values at the tails of the distribution.
Common Use Cases:
StandardScaler
or MinMaxScaler
.There's no single best scaler for all situations.
StandardScaler
is often a good default choice, especially for algorithms assuming zero-centered data or Gaussian distributions.MinMaxScaler
is useful when you need data bounded within a specific range, but be mindful of its sensitivity to outliers.RobustScaler
is the preferred option when dealing with data containing significant outliers.The performance impact of different scaling techniques can vary depending on the algorithm used and the specific characteristics of the data. It's often beneficial to experiment with different scalers as part of your model development process.
In the next section, we will see how to apply these scalers using Scikit-learn's fit
and transform
methods.
© 2025 ApX Machine Learning