Many machine learning algorithms, particularly linear models and those assuming normally distributed errors, can perform suboptimally when dealing with highly skewed numerical features. Skewness refers to the asymmetry in the distribution of data. A common type is right-skewness (or positive skewness), where the tail on the right side of the distribution is longer or fatter than the left side. This often occurs with features representing counts, frequencies, or monetary values, where most observations are clustered at lower values, but a few extremely high values stretch the distribution.
Consider a feature like 'income'. Most people might have incomes within a certain range, but a small number of individuals might have significantly higher incomes, creating a long right tail. These extreme values can disproportionately influence model parameters or distance calculations.
One of the most common and effective techniques to handle right-skewed data is the logarithmic transformation. Applying the natural logarithm (base e, often denoted as ln) or logarithm base 10 (log10) compresses the range of the data, especially at the higher end.
The transformation is defined as: y=log(x) where x is the original feature value and y is the transformed value.
The effect of the logarithm is that large values are scaled down more significantly than smaller values. For example, log(10)≈2.3, log(100)≈4.6, and log(1000)≈6.9. While the absolute difference between 100 and 1000 is 900, the difference between their natural logarithms is only about 2.3. This compression helps to make the distribution more symmetric, often closer to a normal distribution.
A significant limitation of the standard log transformation is that it's only defined for positive values (x>0). Logarithm of zero is undefined, and the logarithm of negative numbers results in complex numbers, which are typically not usable directly in standard machine learning models.
If your data contains zeros but no negative values, a common workaround is to add a small constant, typically 1, to all values before applying the logarithm. This is known as the log(1+x) transformation or log1p:
y=log(1+x)
This transformation has the convenient property that log(1+0)=log(1)=0, preserving the meaning of zero while allowing the transformation for non-negative data. Many numerical libraries, including NumPy, provide a dedicated function np.log1p()
which calculates log(1+x) accurately, even for very small values of x.
If your data contains negative values, the log transformation (even log1p
) cannot be directly applied. In such cases, you might need to consider other transformations like the Yeo-Johnson transformation (covered later in this chapter) or potentially split the feature based on its sign if that makes sense in your domain context.
Applying the log transformation in Python is straightforward using NumPy. Assuming you have your data in a Pandas DataFrame df
and want to transform a column named feature_skewed
:
import numpy as np
import pandas as pd
# Sample skewed data (e.g., simulating income)
np.random.seed(42)
data_skewed = np.random.exponential(scale=10000, size=1000)
# Introduce some zeros
data_skewed[::10] = 0
df = pd.DataFrame({'feature_skewed': data_skewed})
# Check for negative values before applying log
if (df['feature_skewed'] < 0).any():
print("Warning: Feature contains negative values. Log transform is not suitable.")
else:
# Apply log1p transformation to handle potential zeros
df['feature_log_transformed'] = np.log1p(df['feature_skewed'])
# Display first few rows of original and transformed data
print(df[['feature_skewed', 'feature_log_transformed']].head())
# Display basic statistics
print("\nOriginal Statistics:")
print(df['feature_skewed'].describe())
print("\nTransformed Statistics:")
print(df['feature_log_transformed'].describe())
The impact of the log transformation is often best understood visually. Let's compare the distribution of the original skewed feature with its log-transformed version.
{"layout": {"title": "Original vs. Log-Transformed Distribution", "xaxis": {"title": "Original Value (feature_skewed)"}, "yaxis": {"title": "Density"}, "xaxis2": {"title": "Log(1+x) Transformed Value", "overlaying": "x", "side": "bottom", "anchor": "y", "position": 0.15}, "yaxis2": {"overlaying": "y", "side": "right", "title": "Density (Transformed)", "anchor": "x"}, "barmode": "overlay", "legend": {"x": 0.6, "y": 0.95}}, "data": [{"type": "histogram", "x": data_skewed, "name": "Original", "marker": {"color": "#339af0"}, "opacity": 0.75, "histnorm": "probability density"}, {"type": "histogram", "x": np.log1p(data_skewed), "name": "Log1p Transformed", "marker": {"color": "#20c997"}, "opacity": 0.75, "histnorm": "probability density", "xaxis": "x2", "yaxis": "y2"}]}
Comparison of the probability density histograms for the original positively skewed data (blue) and the data after applying the
log1p
transformation (green). The transformed data shows a much more symmetric, bell-like shape.
As seen in the plot, the original distribution is heavily concentrated near zero with a long tail extending towards higher values. After the log1p
transformation, the distribution becomes much more symmetric and spread out, resembling a normal distribution more closely. This transformed feature is often more suitable for algorithms sensitive to scale and distribution shape.
Log transformation is a simple yet effective tool in your feature engineering toolkit, particularly useful for taming features with exponential growth patterns or multiplicative effects often seen in real datasets.
© 2025 ApX Machine Learning