After addressing fundamental data quality issues like incorrect entries and missing values, the next significant step in preparing your data involves transforming features. Raw features often come in varying scales and distributions, which can negatively impact the performance of many machine learning algorithms, particularly those relying on distance calculations (like K-Nearest Neighbors or Support Vector Machines) or gradient descent optimization. This section introduces common techniques for scaling, normalizing, and transforming your data to make it more suitable for modeling.
Consider a dataset with features like 'age' (ranging from 20 to 70) and 'income' (ranging from 30,000 to 250,000). If you use this data directly in an algorithm that calculates distances, the 'income' feature, simply due to its larger scale, will dominate the calculation, potentially overshadowing the influence of 'age', even if 'age' is equally or more informative. Transforming features aims to:
Let's examine some widely used techniques.
Scaling adjusts the range of your numerical features without changing the shape of their distribution. Two popular methods are Min-Max Scaling and Standardization.
Min-Max scaling, often referred to as normalization, rescales features to a fixed range, typically [0, 1]. The formula for a feature x is:
xscaled=max(x)−min(x)x−min(x)Here, min(x) and max(x) are the minimum and maximum values of the feature in the training dataset.
This method is useful when you need your data bounded within a specific range. However, it's quite sensitive to outliers. A single very large or very small value can significantly compress the rest of the data into a narrow part of the [0, 1] range.
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import plotly.graph_objects as go
# Sample skewed data
np.random.seed(42)
data = pd.DataFrame({
'FeatureA': np.random.gamma(2, 2, 100) * 10, # Skewed
'FeatureB': np.random.normal(50, 10, 100) # Normal-ish
})
# Initialize and fit the scaler
min_max_scaler = MinMaxScaler()
# IMPORTANT: Fit only on training data in a real scenario
scaled_data_mm = min_max_scaler.fit_transform(data)
scaled_df_mm = pd.DataFrame(scaled_data_mm, columns=['FeatureA_scaled', 'FeatureB_scaled'])
# Visualization (optional comparison)
fig = go.Figure()
fig.add_trace(go.Histogram(x=data['FeatureA'], name='Original Feature A', marker_color='#1c7ed6', nbinsx=15))
fig.add_trace(go.Histogram(x=scaled_df_mm['FeatureA_scaled'], name='MinMax Scaled Feature A', marker_color='#ff922b', nbinsx=15, xaxis='x2', yaxis='y2'))
fig.update_layout(
title_text='Min-Max Scaling Effect (Shape Unchanged)',
xaxis_title='Original Value', yaxis_title='Count',
xaxis2=dict(title='Scaled Value [0,1]', overlaying='x', side='top'), yaxis2=dict(overlaying='y', side='right'),
bargap=0.1, height=350, legend=dict(yanchor="top", y=0.99, xanchor="right", x=0.99), margin=dict(l=20, r=20, t=50, b=20)
)
# fig.show() # Display plot in an interactive environment
Comparison of a feature's distribution before and after Min-Max scaling. The range is compressed to [0, 1], but the overall shape (skewness) remains the same.
Standardization rescales features so they have a mean (μ) of 0 and a standard deviation (σ) of 1. The formula is:
xstandardized=σx−μAgain, μ and σ are calculated from the training data. The resulting distribution will have a mean of 0 and unit variance. Standardization is less affected by outliers than Min-Max scaling and is often preferred for algorithms that assume normally distributed data centered around zero, or those sensitive to feature variance.
from sklearn.preprocessing import StandardScaler
import plotly.graph_objects as go
# Assume 'data' DataFrame from previous example exists
# Initialize and fit the scaler
standard_scaler = StandardScaler()
# IMPORTANT: Fit only on training data in a real scenario
scaled_data_std = standard_scaler.fit_transform(data)
scaled_df_std = pd.DataFrame(scaled_data_std, columns=['FeatureA_scaled', 'FeatureB_scaled'])
# Visualization
fig_std = go.Figure()
fig_std.add_trace(go.Histogram(x=data['FeatureA'], name='Original Feature A', marker_color='#1c7ed6', nbinsx=15))
fig_std.add_trace(go.Histogram(x=scaled_df_std['FeatureA_scaled'], name='Standardized Feature A', marker_color='#7048e8', nbinsx=15, xaxis='x2', yaxis='y2'))
fig_std.update_layout(
title_text='Standardization Effect (Shape Unchanged)',
xaxis_title='Original Value', yaxis_title='Count',
xaxis2=dict(title='Standardized Value (Mean=0, SD=1)', overlaying='x', side='top'), yaxis2=dict(overlaying='y', side='right'),
bargap=0.1, height=350, legend=dict(yanchor="top", y=0.99, xanchor="right", x=0.99), margin=dict(l=20, r=20, t=50, b=20)
)
# fig_std.show() # Display plot in an interactive environment
Comparison of a feature's distribution before and after Standardization. The feature is centered around 0 with a standard deviation of 1, but the skewness remains.
Sometimes, simply scaling isn't enough. If a feature's distribution is heavily skewed, applying a non-linear transformation can help make it more symmetric, potentially improving model performance.
The logarithm function compresses the range of large values and expands the range of small values. This makes it effective at reducing right-skewness (where the tail extends to the right).
numpy.log1p
) if values include zero.# Assume 'data' DataFrame with skewed 'FeatureA' exists
data['FeatureA_log'] = np.log1p(data['FeatureA']) # Use log1p for safety with potential 0s
# Visualization
fig_log = go.Figure()
fig_log.add_trace(go.Histogram(x=data['FeatureA'], name='Original Feature A', marker_color='#1c7ed6', nbinsx=15))
fig_log.add_trace(go.Histogram(x=data['FeatureA_log'], name='Log Transformed Feature A', marker_color='#20c997', nbinsx=15, xaxis='x2', yaxis='y2'))
fig_log.update_layout(
title_text='Log Transformation Effect on Skewed Data',
xaxis_title='Original Value', yaxis_title='Count',
xaxis2=dict(title='Log(1+Value)', overlaying='x', side='top'), yaxis2=dict(overlaying='y', side='right'),
bargap=0.1, height=350, legend=dict(yanchor="top", y=0.99, xanchor="right", x=0.99), margin=dict(l=20, r=20, t=50, b=20)
)
# fig_log.show() # Display plot in an interactive environment
{"layout": {"title": {"text": "Log Transformation Effect on Skewed Data"}, "xaxis": {"title": "Original Value"}, "yaxis": {"title": "Count"}, "xaxis2": {"title": "Log(1+Value)", "overlaying": "x", "side": "top"}, "yaxis2": {"overlaying": "y", "side": "right"}, "bargap": 0.1, "height": 350, "legend": {"yanchor": "top", "y": 0.99, "xanchor": "right", "x": 0.99}, "margin": {"l": 20, "r": 20, "t": 50, "b": 20}}, "data": [{"type": "histogram", "x": [25.189111, 3.550108, 13.087884, 60.201311, 35.830938, 22.876369, 15.370904, 17.657475, 33.039507, 20.492071, 3.622366, 44.891902, 35.227802, 21.835556, 45.764233, 22.943996, 10.379933, 15.097931, 7.930903, 27.314806, 28.734366, 22.309165, 34.010965, 10.204829, 41.465735, 33.460476, 6.882727, 11.324313, 31.709783, 16.937075, 45.030573, 25.735518, 16.71754, 18.228756, 37.260895, 35.43618, 33.105487, 31.058024, 51.519706, 25.298306, 32.112421, 16.49835, 33.11234, 11.704552, 22.719204, 36.69339, 51.134577, 28.478218, 18.046695, 20.995845, 52.373627, 10.147184, 10.89916, 12.892583, 27.480429, 51.468944, 14.16084, 22.661649, 16.720034, 18.016989, 39.647051, 24.381999, 22.626494, 46.84119, 12.75825, 18.778724, 40.04377, 24.440361, 23.431058, 24.317533, 22.228711, 48.739278, 18.097285, 14.518982, 16.736277, 13.304785, 21.634973, 22.509017, 12.496137, 11.195871, 26.078889, 5.29289, 50.528114, 13.811333, 24.036307, 31.941443, 30.974886, 30.820344, 18.972149, 12.452203, 33.264347, 13.178073, 44.76769, 26.927067, 14.440191, 32.446621, 23.025803, 52.738866, 13.697486], "name": "Original Feature A", "marker": {"color": "#1c7ed6"}, "nbinsx": 15}, {"type": "histogram", "x": [3.265333, 1.515154, 2.645315, 4.11418, 3.606417, 3.172842, 2.795526, 2.92626, 3.527566, 3.067731, 1.530887, 3.826315, 3.58977, 3.128308, 3.845028, 3.175741, 2.431888, 2.778687, 2.189544, 3.34342, 3.392329, 3.148874, 3.555659, 2.416339, 3.748747, 3.539932, 2.064677, 2.511563, 3.487689, 2.886869, 3.829365, 3.286007, 2.874619, 2.956335, 3.644451, 3.6 A_log': 18, 3.529485, 3.467531, 3.961186, 3.269533, 3.500008, 2.862105, 3.529699, 2.54196, 3.166321, 3.629529, 3.953841, 3.383625, 2.946911, 3.090852, 3.977326, 2.41128, 2.476469, 2.63143, 3.349285, 3.960242, 2.718709, 3.163813, 2.87475, 2.945374, 3.704929, 3.234025, 3.162332, 3.868026, 2.621616, 2.984571, 3.714604, 3.236339, 3.195922, 3.231413, 3.14543, 3.906767, 2.949668, 2.742077, 2.875603, 2.6606, 3.11948, 3.157398, 2.594953, 2.49287, 3.298826, 1.839429, 3.942156, 2.695383, 3.2197, 3.49471, 3.46492, 3.460108, 2.99432, 2.591697, 3.534112, 2.651752, 3.823548, 3.329622, 2.736973, 3.510038, 3.179146, 3.984085, 2.687688], "name": "Log Transformed Feature A", "marker": {"color": "#20c997"}, "nbinsx": 15, "xaxis": "x2", "yaxis": "y2"}]}
The log transformation makes the distribution of 'Feature A' appear much more symmetric (closer to a bell shape) compared to the original right-skewed distribution.
The Box-Cox transformation is a more general power transformation that can find a near-optimal transformation for your data to make it more normal-like. It's defined as:
x(λ)={λxλ−1ln(x)if λ=0if λ=0The transformation finds the best value of λ (lambda) to stabilize variance and improve normality. A significant limitation is that Box-Cox requires all data to be positive.
from scipy.stats import boxcox
import plotly.graph_objects as go
# Assume 'data' DataFrame with positive, skewed 'FeatureA' exists
# Ensure FeatureA is positive before applying Box-Cox
if (data['FeatureA'] > 0).all():
# Apply Box-Cox: returns transformed data and optimal lambda
featureA_boxcox, best_lambda = boxcox(data['FeatureA'])
print(f"Optimal Lambda found by Box-Cox: {best_lambda:.4f}")
# Visualization
fig_boxcox = go.Figure()
fig_boxcox.add_trace(go.Histogram(x=data['FeatureA'], name='Original Feature A', marker_color='#1c7ed6', nbinsx=15))
fig_boxcox.add_trace(go.Histogram(x=featureA_boxcox, name='Box-Cox Transformed Feature A', marker_color='#be4bdb', nbinsx=15, xaxis='x2', yaxis='y2'))
fig_boxcox.update_layout(
title_text='Box-Cox Transformation Effect',
xaxis_title='Original Value', yaxis_title='Count',
xaxis2=dict(title='Box-Cox Transformed Value', overlaying='x', side='top'), yaxis2=dict(overlaying='y', side='right'),
bargap=0.1, height=350, legend=dict(yanchor="top", y=0.99, xanchor="right", x=0.99), margin=dict(l=20, r=20, t=50, b=20)
)
# fig_boxcox.show() # Display plot
else:
print("FeatureA contains non-positive values. Box-Cox cannot be applied directly.")
# Create an empty plotly json if cannot be applied
fig_boxcox_json = '{"layout": {"title": {"text": "Box-Cox Skipped (Non-Positive Data)"},"height": 100}, "data": []}'
The Box-Cox transformation also makes the skewed data more symmetric, similar to the log transform in this case, by automatically finding an appropriate power transformation.
The Yeo-Johnson transformation is similar in spirit to Box-Cox but has the advantage of working with non-positive data. If your data contains zeros or negative numbers, Yeo-Johnson is a suitable alternative for achieving a more normal-like distribution. It's available in scikit-learn
as PowerTransformer(method='yeo-johnson')
.
A critical point when applying any transformation (scaling or distribution adjustment) is to fit the transformer only on the training data. You learn the parameters (like min/max, mean/std, or lambda) from the training set and then use these learned parameters to transform the training set, validation set, and test set.
Why? Fitting on the entire dataset before splitting would cause data leakage. Information from the test set (e.g., its minimum or maximum value) would leak into the training process, leading to an overly optimistic estimate of your model's performance on unseen data.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression # Example model
from sklearn.pipeline import Pipeline
# Assume 'X' is your feature matrix and 'y' is your target vector
# X = data[['FeatureA', 'FeatureB']] # Example features
# y = ... # Your target variable
# Split data first!
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Fit scaler ONLY on training data
# scaler = StandardScaler()
# X_train_scaled = scaler.fit_transform(X_train)
# Apply the SAME fitted scaler to test data
# X_test_scaled = scaler.transform(X_test) # Use transform(), NOT fit_transform()
# -- Using Pipelines simplifies this --
# Define steps: 1. Scale, 2. Model
# pipe = Pipeline([
# ('scaler', StandardScaler()),
# ('classifier', LogisticRegression())
# ])
# Fit the entire pipeline on the training data
# The pipeline handles fitting the scaler and then training the model
# pipe.fit(X_train, y_train)
# Predict on the test data
# The pipeline automatically transforms the test data using the fitted scaler
# predictions = pipe.predict(X_test)
# score = pipe.score(X_test, y_test)
# print(f"Model score on test data: {score:.4f}")
Using scikit-learn
Pipelines, as shown above, is the recommended way to chain preprocessing steps and modeling. The pipeline ensures that fitting happens only on the training fold during cross-validation and that the correct transformations are applied sequentially.
Data transformation and normalization are essential steps in preparing data for machine learning. By bringing features to a comparable scale using techniques like Min-Max Scaling or Standardization, you prevent certain features from unduly influencing model outcomes. Furthermore, transforming skewed distributions using methods like Log or Box-Cox transformations can help models that perform better with more symmetric, normal-like data. Remember the significant rule: always fit your transformers on the training data only and then use them to transform both training and testing datasets to prevent data leakage. These techniques lay the groundwork for effective feature engineering and robust model building.
© 2025 ApX Machine Learning