All Courses

Scaling for Outliers

While Standardization and Min-Max scaling are common techniques for bringing numerical features onto a common scale, they share a potential vulnerability: sensitivity to outliers. Remember that Standardization uses the mean ( $\mu$ ) and standard deviation ( $\sigma$ ), while Min-Max scaling uses the minimum and maximum values. Outliers, being extreme values, can significantly distort these statistics. A single very large or very small value can drastically shift the mean, inflate the standard deviation, or alter the min/max values, leading to compressed inlier data and potentially suboptimal scaling for the bulk of your observations.

When your dataset contains significant outliers, using a scaling technique that is less sensitive to these extreme values is often beneficial. This is where RobustScaler from Scikit-learn comes in. Instead of the mean and standard deviation or min/max, RobustScaler uses statistics that are more sensitive to outliers: the median and the interquartile range (IQR).

The IQR is the range between the 1st quartile (25th percentile, $Q1$ ) and the 3rd quartile (75th percentile, $Q3$ ) of the data. $IQR = Q3 - Q1$ By definition, the IQR contains the central 50% of the data, making it much less affected by extreme values in the tails of the distribution compared to the standard deviation or the overall range (max - min).

RobustScaler centers the data by subtracting the median and then scales it by dividing by the IQR. The formula is: $X_{scaled} = \frac{X - Median}{IQR}$ The main idea is using quantiles resistant to outliers.

Why Use Scaling?

Consider a feature with values [1, 2, 3, 4, 5, 100].

Mean: 19.17, Std Dev: ~36.3
Median: 3.5, Q1: 2.25, Q3: 4.75, IQR: 2.5

If we use StandardScaler, the outlier (100) heavily influences the mean and standard deviation, causing most data points ([1, 2, 3, 4, 5]) to be squeezed into a small range after scaling. If we use MinMaxScaler, the outlier (100) defines the maximum, again potentially squeezing the other points. Using RobustScaler, the median (3.5) and IQR (2.5) are much less affected by the value 100. Scaling based on these statistics preserves the relative spacing of the non-outlier points more effectively.

Implementation with Scikit-learn

Using RobustScaler is straightforward and follows the familiar Scikit-learn transformer API.

import numpy as np
import pandas as pd
from sklearn.preprocessing import RobustScaler, StandardScaler, MinMaxScaler
import plotly.graph_objects as go

# Sample data with an outlier
data = np.array([1, 2, 3, 4, 5, 100]).reshape(-1, 1)
df = pd.DataFrame(data, columns=['Feature'])

# Initialize scalers
robust_scaler = RobustScaler()
standard_scaler = StandardScaler()
minmax_scaler = MinMaxScaler()

# Apply scalers
df['RobustScaled'] = robust_scaler.fit_transform(df[['Feature']])
df['StandardScaled'] = standard_scaler.fit_transform(df[['Feature']])
df['MinMaxScaled'] = minmax_scaler.fit_transform(df[['Feature']])

print(df)

# Visualization (Optional - demonstrating the effect)
fig = go.Figure()

# Original Data (shifted slightly for visibility)
fig.add_trace(go.Scatter(
    x=df.index - 0.1, y=df['Feature'], mode='markers', name='Original',
    marker=dict(color='#adb5bd', size=8)
))

# Scaled
fig.add_trace(go.Scatter(
    x=df.index - 0.03, y=df['RobustScaled'], mode='markers', name='Scaled',
    marker=dict(color='#7950f2', size=8) # violet
))

# Standard Scaled
fig.add_trace(go.Scatter(
    x=df.index + 0.03, y=df['StandardScaled'], mode='markers', name='Standard Scaled',
    marker=dict(color='#1c7ed6', size=8) # blue
))

# MinMax Scaled
fig.add_trace(go.Scatter(
    x=df.index + 0.1, y=df['MinMaxScaled'], mode='markers', name='MinMax Scaled',
    marker=dict(color='#12b886', size=8) # teal
))

fig.update_layout(
    title='Comparison of Scaling Methods with an Outlier',
    xaxis_title='Data Point Index',
    yaxis_title='Scaled Value',
    legend_title='Scaler Type',
    template='plotly_white',
    width=700,
    height=400
)

# To display the plot (if running interactively)
# fig.show()

# To represent the chart in markdown format:
plotly_json = fig.to_json(pretty=False)
print(f'\n```plotly\n{plotly_json}\n```')

Executing this code produces the following DataFrame:

   Feature  RobustScaled  StandardScaled  MinMaxScaled
0        1     -1.000000       -0.500674      0.000000
1        2     -0.600000       -0.473060      0.010101
2        3     -0.200000       -0.445446      0.020202
3        4      0.200000       -0.417832      0.030303
4        5      0.600000       -0.390218      0.040404
5      100     38.600000        2.227231      1.000000

Comparison of how RobustScaler, StandardScaler, and MinMaxScaler handle a dataset containing a significant outlier (value 100). Notice how RobustScaler keeps the non-outlier points (0-4) more spread out compared to the other two methods.

Observe how the first five points (the "inliers") maintain a more reasonable spread with RobustScaler (ranging from -1.0 to 0.6) compared to StandardScaler (ranging from -0.50 to -0.39) or MinMaxScaler (ranging from 0.0 to 0.04). The outlier (100) is still present but doesn't dominate the scale for the rest of the data as much.

Important parameters for RobustScaler:

with_centering: Boolean (default True). If True, center the data by subtracting the median before scaling.
with_scaling: Boolean (default True). If True, scale the data by dividing by the IQR.
quantile_range: Tuple (float, float), default (25.0, 75.0). Specifies the quantile range used to compute the scale (IQR by default). You could adjust this, for example, to use the 10th and 90th percentiles (10.0, 90.0).

When to Choose Scaling

RobustScaler is a good choice when:

Your data contains significant outliers.
You are using algorithms sensitive to the scale of input features, such as linear models (Linear Regression, Logistic Regression, SVM), K-Nearest Neighbors (KNN), or neural networks optimized with gradient descent.
You want to preserve the relative distances between non-outlier data points better than StandardScaler or MinMaxScaler might in the presence of outliers.

However, if your data is roughly normally distributed and doesn't have significant outliers, StandardScaler might be a more conventional choice, often aligning well with the assumptions of certain statistical models. As with most feature engineering decisions, the best approach often involves experimenting with different scaling methods and evaluating their impact on model performance through cross-validation.

Was this section helpful?