Alright, let's put theory into practice. You've learned about various methods for scaling and transforming numerical features. Now, we'll apply these techniques using Python's Scikit-learn library on a sample dataset. This hands-on exercise will solidify your understanding of how StandardScaler
, MinMaxScaler
, RobustScaler
, PowerTransformer
, and QuantileTransformer
work and how their effects differ.
First, let's import the necessary libraries and create a sample dataset. We'll use Pandas for data manipulation, NumPy for numerical operations, Matplotlib and Seaborn for initial visualization insights (though we'll represent final plots in a web-friendly format), and Scikit-learn for the scaling and transformation tools.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, PowerTransformer, QuantileTransformer
from sklearn.model_selection import train_test_split
import scipy.stats as stats
# Generate synthetic data with different scales and skewness
np.random.seed(42) # for reproducibility
data = pd.DataFrame({
'Feature_A': np.random.rand(100) * 100, # Scale 0-100
'Feature_B': np.random.randn(100) * 10 + 50, # Normal distribution, different scale
'Feature_C': np.random.exponential(scale=20, size=100) + 1, # Exponential (skewed), scale > 0
'Feature_D': np.random.rand(100) * 10 - 5 # Includes negative values
})
# Add some outliers to Feature_A
data.loc[[10, 30, 90], 'Feature_A'] = [250, -80, 300]
# Split data for demonstration (optional but good practice)
# In a real scenario, fit transformers ONLY on training data
X_train, X_test = train_test_split(data, test_size=0.3, random_state=42)
print("Original Data Description (Training Set):")
print(X_train.describe())
print("\nOriginal Data Head (Training Set):")
print(X_train.head())
Before applying any transformations, let's visualize the distributions of our training features. This helps identify differing scales and skewness, highlighting why scaling and transformation might be necessary.
# Visualize original distributions (using seaborn for concept, represent final with Plotly JSON)
fig, axes = plt.subplots(1, 4, figsize=(18, 4))
sns.histplot(X_train['Feature_A'], kde=True, ax=axes[0], color='#4dabf7')
axes[0].set_title('Feature_A Distribution')
sns.histplot(X_train['Feature_B'], kde=True, ax=axes[1], color='#748ffc')
axes[1].set_title('Feature_B Distribution')
sns.histplot(X_train['Feature_C'], kde=True, ax=axes[2], color='#f06595')
axes[2].set_title('Feature_C Distribution (Skewed)')
sns.histplot(X_train['Feature_D'], kde=True, ax=axes[3], color='#94d82d')
axes[3].set_title('Feature_D Distribution')
plt.tight_layout()
# plt.show() # We will represent this idea with Plotly below
Let's represent the distribution for Feature_C
, which shows noticeable skewness.
Distribution plot for
Feature_C
showing its right-skewed nature before any transformation.
We can observe varying ranges (e.g., Feature_A
potentially from -80 to 300, Feature_D
from -5 to 5) and different distribution shapes (Feature_C
is clearly skewed right).
Scaling adjusts the range of features without changing the shape of their distribution significantly. Remember to fit
the scaler on the training data and then transform
both training and testing data.
StandardScaler
removes the mean and scales features to unit variance. The formula is Z=(x−μ)/σ.
# Initialize and fit StandardScaler
scaler_standard = StandardScaler()
scaler_standard.fit(X_train) # Fit ONLY on training data
# Transform training and test data
X_train_std = scaler_standard.transform(X_train)
X_test_std = scaler_standard.transform(X_test)
# Convert back to DataFrame for easier inspection
X_train_std_df = pd.DataFrame(X_train_std, columns=X_train.columns, index=X_train.index)
print("\nStandardized Data Description (Training Set):")
print(X_train_std_df.describe().round(2)) # Mean should be ~0, std dev ~1
Notice how the mean
is approximately 0 and the std
(standard deviation) is approximately 1 for all features after standardization.
MinMaxScaler
scales features to a fixed range, typically [0, 1]. The formula is Xscaled=(x−min(x))/(max(x)−min(x)).
# Initialize and fit MinMaxScaler
scaler_minmax = MinMaxScaler()
scaler_minmax.fit(X_train)
# Transform training and test data
X_train_minmax = scaler_minmax.transform(X_train)
X_test_minmax = scaler_minmax.transform(X_test)
# Convert back to DataFrame
X_train_minmax_df = pd.DataFrame(X_train_minmax, columns=X_train.columns, index=X_train.index)
print("\nMin-Max Scaled Data Description (Training Set):")
print(X_train_minmax_df.describe().round(2)) # Min should be 0, max should be 1
Here, the min
is 0 and the max
is 1 for all features, as expected.
RobustScaler
uses statistics robust to outliers, specifically the interquartile range (IQR). It removes the median and scales data according to the quantile range (default is IQR: Q3 - Q1).
# Initialize and fit RobustScaler
scaler_robust = RobustScaler()
scaler_robust.fit(X_train)
# Transform training and test data
X_train_robust = scaler_robust.transform(X_train)
X_test_robust = scaler_robust.transform(X_test)
# Convert back to DataFrame
X_train_robust_df = pd.DataFrame(X_train_robust, columns=X_train.columns, index=X_train.index)
print("\nRobust Scaled Data Description (Training Set):")
print(X_train_robust_df.describe().round(2)) # Median should be ~0
RobustScaler
centers the data around the median (which becomes approximately 0) and scales by the IQR. This approach is less influenced by the large outlier values we introduced in Feature_A
.
Let's visualize the effect of these scalers on Feature_A
, which has outliers.
{"layout": {"title": "Effect of Scalers on Feature_A (with Outliers)", "xaxis": {"title": "Scaled Value"}, "yaxis": {"title": "Density"}, "autosize": true, "legend": {"title": {"text": "Scaler Type"}}}, "data": [{"type": "histogram", "x": X_train['Feature_A'].values.tolist(), "name": "Original", "marker": {"color": "#adb5bd"}, "histnorm": "probability density", "opacity": 0.7}, {"type": "histogram", "x": X_train_std_df['Feature_A'].values.tolist(), "name": "StandardScaler", "marker": {"color": "#4dabf7"}, "histnorm": "probability density", "opacity": 0.7}, {"type": "histogram", "x": X_train_minmax_df['Feature_A'].values.tolist(), "name": "MinMaxScaler", "marker": {"color": "#748ffc"}, "histnorm": "probability density", "opacity": 0.7}, {"type": "histogram", "x": X_train_robust_df['Feature_A'].values.tolist(), "name": "RobustScaler", "marker": {"color": "#f06595"}, "histnorm": "probability density", "opacity": 0.7}]}
Comparison of
Feature_A
distributions after applying different scalers. Note howRobustScaler
concentrates the bulk of the data compared toStandardScaler
andMinMaxScaler
, which are more affected by the outliers stretching their ranges.
Transformations aim to change the shape of the distribution, often to make it more Gaussian-like or uniform. This can be beneficial for models that assume normality.
PowerTransformer
applies Box-Cox (requires positive data) or Yeo-Johnson (handles positive, zero, and negative data) transformations to stabilize variance and minimize skewness.
Let's apply Yeo-Johnson to Feature_C
(skewed, positive) and Feature_D
(includes negative values).
# Initialize and fit PowerTransformer (Yeo-Johnson)
pt_yj = PowerTransformer(method='yeo-johnson', standardize=True) # standardize=True applies Z-scaling after transformation
# Fit on the training data for selected columns
pt_yj.fit(X_train[['Feature_C', 'Feature_D']])
# Transform training data
X_train_yj = pt_yj.transform(X_train[['Feature_C', 'Feature_D']])
X_train_yj_df = pd.DataFrame(X_train_yj, columns=['Feature_C_yj', 'Feature_D_yj'], index=X_train.index)
# Apply Box-Cox only to Feature_C (must be strictly positive)
pt_bc = PowerTransformer(method='box-cox', standardize=True)
pt_bc.fit(X_train[['Feature_C']]) # Fit only on Feature_C
# Transform Feature_C in training data
X_train_bc = pt_bc.transform(X_train[['Feature_C']])
X_train_bc_df = pd.DataFrame(X_train_bc, columns=['Feature_C_bc'], index=X_train.index)
# Combine transformed features for visualization
X_train_transformed = pd.concat([X_train_yj_df, X_train_bc_df], axis=1)
print("\nTransformed Data Head (Training Set):")
print(X_train_transformed.head())
Let's visualize the original Feature_C
versus its transformed versions.
{"layout": {"title": "Power Transformations on Feature_C", "xaxis": {"title": "Value"}, "yaxis": {"title": "Density"}, "autosize": true, "legend": {"title": {"text": "Transformation"}}}, "data": [{"type": "histogram", "x": X_train['Feature_C'].values.tolist(), "name": "Original", "marker": {"color": "#adb5bd"}, "histnorm": "probability density", "opacity": 0.7, "nbinsx": 20}, {"type": "histogram", "x": X_train_transformed['Feature_C_yj'].values.tolist(), "name": "Yeo-Johnson", "marker": {"color": "#ae3ec9"}, "histnorm": "probability density", "opacity": 0.7, "nbinsx": 20}, {"type": "histogram", "x": X_train_transformed['Feature_C_bc'].values.tolist(), "name": "Box-Cox", "marker": {"color": "#be4bdb"}, "histnorm": "probability density", "opacity": 0.7, "nbinsx": 20}]}
Comparison of
Feature_C
distributions: original (skewed), after Yeo-Johnson transformation, and after Box-Cox transformation. Both transformations significantly reduce skewness, making the distribution more symmetric.
We can also use probability plots (Q-Q plots) to visually assess how close the transformed distribution is to a normal distribution. Points falling approximately on the diagonal line suggest normality.
# Visualize normality with Q-Q plots (conceptual using Matplotlib/SciPy)
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
stats.probplot(X_train['Feature_C'], dist="norm", plot=axes[0])
axes[0].set_title('Original Feature_C Q-Q Plot')
stats.probplot(X_train_transformed['Feature_C_yj'], dist="norm", plot=axes[1])
axes[1].set_title('Yeo-Johnson Feature_C Q-Q Plot')
stats.probplot(X_train_transformed['Feature_C_bc'], dist="norm", plot=axes[2])
axes[2].set_title('Box-Cox Feature_C Q-Q Plot')
plt.tight_layout()
# plt.show() # In a real interface, these plots would show points closer to the line after transformation.
The Q-Q plots would visually confirm that the points for the transformed features align much better with the diagonal line compared to the original skewed feature, indicating a closer approximation to a normal distribution.
QuantileTransformer
maps the data distribution to a uniform or normal distribution based on quantiles. It can make dissimilar distributions more alike.
# Initialize and fit QuantileTransformer (to Uniform)
qt_uniform = QuantileTransformer(output_distribution='uniform', n_quantiles=min(len(X_train), 100), random_state=42)
qt_uniform.fit(X_train)
# Transform training data
X_train_qt_uniform = qt_uniform.transform(X_train)
X_train_qt_uniform_df = pd.DataFrame(X_train_qt_uniform, columns=X_train.columns, index=X_train.index)
# Initialize and fit QuantileTransformer (to Normal)
qt_normal = QuantileTransformer(output_distribution='normal', n_quantiles=min(len(X_train), 100), random_state=42)
qt_normal.fit(X_train)
# Transform training data
X_train_qt_normal = qt_normal.transform(X_train)
X_train_qt_normal_df = pd.DataFrame(X_train_qt_normal, columns=X_train.columns, index=X_train.index)
print("\nQuantile Transformed Data (Uniform) Description:")
print(X_train_qt_uniform_df.describe().round(2)) # Should be approx uniform [0, 1]
print("\nQuantile Transformed Data (Normal) Description:")
print(X_train_qt_normal_df.describe().round(2)) # Should be approx normal (mean~0, std~1)
Let's visualize the effect of Quantile Transformation on Feature_C
.
{"layout": {"title": "Quantile Transformations on Feature_C", "xaxis": {"title": "Value"}, "yaxis": {"title": "Density"}, "autosize": true, "legend": {"title": {"text": "Transformation"}}}, "data": [{"type": "histogram", "x": X_train['Feature_C'].values.tolist(), "name": "Original", "marker": {"color": "#adb5bd"}, "histnorm": "probability density", "opacity": 0.7, "nbinsx": 20}, {"type": "histogram", "x": X_train_qt_uniform_df['Feature_C'].values.tolist(), "name": "Uniform Output", "marker": {"color": "#1098ad"}, "histnorm": "probability density", "opacity": 0.7, "nbinsx": 20}, {"type": "histogram", "x": X_train_qt_normal_df['Feature_C'].values.tolist(), "name": "Normal Output", "marker": {"color": "#0ca678"}, "histnorm": "probability density", "opacity": 0.7, "nbinsx": 20}]}
Comparison of
Feature_C
distributions: original (skewed), after Quantile Transformation to a uniform distribution, and after Quantile Transformation to a normal distribution. The transformer reshapes the data effectively based on rank.
In practice, these transformers are often used as steps within a Scikit-learn Pipeline
. This ensures that scaling/transformation is applied correctly during cross-validation and when making predictions on new data.
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression # Example model
# Create a pipeline: Robust Scale -> Power Transform (Yeo-Johnson) -> Linear Regression
# Apply transformations only to specific columns if needed (using ColumnTransformer - more advanced)
# For simplicity here, assume we apply to all features passed in
# Example for Feature_C and Feature_D which benefit from Yeo-Johnson
pipeline = Pipeline([
('scaler', RobustScaler()), # Handle potential outliers first
('transformer', PowerTransformer(method='yeo-johnson', standardize=True)),
('model', LinearRegression()) # Example model step
])
# You would then fit this pipeline on your training data (X_train, y_train)
# pipeline.fit(X_train[['Feature_C', 'Feature_D']], y_train) # Assuming y_train exists
# The pipeline automatically applies fit_transform on scaler/transformer during fit
# and transform on scaler/transformer during predict/score
print("\nPipeline created (example structure):")
print(pipeline)
This practical session demonstrated how to apply various scaling and transformation techniques using Scikit-learn:
StandardScaler
, MinMaxScaler
, RobustScaler
): Adjusts the range/scale of features. Important for distance-based algorithms and gradient descent. RobustScaler
is less sensitive to outliers.PowerTransformer
, QuantileTransformer
): Changes the shape of the distribution, often to reduce skewness or approximate normality/uniformity. Beneficial for models assuming specific distributions.You saw how to fit these transformers on training data and apply them to both training and test sets. Visualizing the distributions before and after applying these methods is a valuable step in understanding their impact. Remember that the choice of technique depends on the specific characteristics of your data and the requirements of the machine learning model you intend to use. Experimentation and evaluation are often needed to find the best approach for your specific problem.
© 2025 ApX Machine Learning