As mentioned in the chapter introduction, raw numerical features, while seemingly straightforward, often don't reveal their full potential to machine learning algorithms directly. Models like linear regression assume linear relationships, while others might struggle with skewed distributions or large ranges. Generating new features from existing numerical ones can help models better capture underlying patterns, handle non-linearities, and improve overall performance. Let's look at some common and effective techniques.
Sometimes, the exact numerical value of a feature isn't as important as the range or bin it falls into. Binning, or discretization, involves converting continuous numerical features into discrete categorical ones by grouping values into predefined bins. This can be useful for several reasons:
There are two main approaches to binning:
Let's see how to implement this using Pandas. Assume we have a DataFrame df
with an Age
column.
import pandas as pd
import numpy as np
# Sample data
data = {'Age': [22, 25, 31, 45, 58, 62, 75, 81, 19, 38]}
df = pd.DataFrame(data)
# 1. Fixed-Width Binning (e.g., 4 bins)
# Define bin edges explicitly
bins = [18, 30, 45, 60, 100] # Ages 18-30, 31-45, 46-60, 61-100
labels = ['18-30', '31-45', '46-60', '61+']
df['Age_Bin_Fixed'] = pd.cut(df['Age'], bins=bins, labels=labels, right=True, include_lowest=True)
# Or let pandas determine equal-width bins
# df['Age_Bin_Fixed_Auto'] = pd.cut(df['Age'], bins=4) # Creates 4 equal-width bins
# 2. Quantile-Based Binning (e.g., 4 quantiles/quartiles)
df['Age_Bin_Quantile'] = pd.qcut(df['Age'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
print(df)
Choosing the Number of Bins: The number of bins is a hyperparameter. Too few bins might oversimplify the data, losing valuable information. Too many bins might make the feature too granular, approaching the original continuous variable and potentially leading to overfitting. Cross-validation can help determine an appropriate number of bins.
Linear models, by definition, model linear relationships (y=w1x1+w2x2+...+b). However, real-world relationships are often non-linear. Polynomial features create new features by raising existing features to a power (e.g., x2, x3) and creating interaction terms between features (e.g., x1⋅x2).
Consider a simple feature x. Adding x2 allows a linear model to fit a quadratic relationship: y=w1x+w2x2+b. This is still a linear model because it's linear with respect to the coefficients (w1,w2,b), even though the relationship between y and the original x is non-linear.
Scikit-learn's PolynomialFeatures
transformer is a convenient way to generate these.
from sklearn.preprocessing import PolynomialFeatures
import pandas as pd
# Sample data with two features
data = {'FeatureA': [1, 2, 3, 4, 5],
'FeatureB': [2, 3, 5, 5, 7]}
df_poly = pd.DataFrame(data)
# Create polynomial features up to degree 2
# include_bias=False avoids adding a column of ones (intercept)
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(df_poly)
# Get feature names for clarity
feature_names = poly.get_feature_names_out(df_poly.columns)
# Create a new DataFrame with these features
df_poly_transformed = pd.DataFrame(poly_features, columns=feature_names)
print("Original Features:")
print(df_poly)
print("\nPolynomial Features (degree=2):")
print(df_poly_transformed)
# Example for degree=3, only FeatureA
poly_deg3 = PolynomialFeatures(degree=3, include_bias=False)
poly_features_a = poly_deg3.fit_transform(df_poly[['FeatureA']])
feature_names_a = poly_deg3.get_feature_names_out(['FeatureA'])
df_poly_a_transformed = pd.DataFrame(poly_features_a, columns=feature_names_a)
print("\nPolynomial Features (degree=3, FeatureA only):")
print(df_poly_a_transformed)
Considerations:
Applying mathematical functions like logarithms, square roots, or reciprocals can help stabilize variance, make distributions more symmetric (closer to normal), and handle skewed data. Many machine learning algorithms perform better when features follow a distribution closer to a Gaussian (normal) distribution.
Let's apply log and square root transformations using NumPy.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
# Sample skewed data (e.g., income)
np.random.seed(42)
income_data = np.random.gamma(2, scale=2000, size=1000)
income_data = np.round(income_data, 2)
df_trans = pd.DataFrame({'Income': income_data})
# Add a small constant for log transform if zeros are possible (not needed here)
# df_trans['Income_Log'] = np.log(df_trans['Income'] + 1)
# Apply Log Transformation
df_trans['Income_Log'] = np.log(df_trans['Income'])
# Apply Square Root Transformation
df_trans['Income_Sqrt'] = np.sqrt(df_trans['Income'])
# Apply Box-Cox Transformation
# scipy.stats.boxcox returns the transformed data and the optimal lambda
df_trans['Income_BoxCox'], best_lambda = stats.boxcox(df_trans['Income'])
print(f"Optimal lambda for Box-Cox: {best_lambda:.4f}")
print(df_trans[['Income', 'Income_Log', 'Income_Sqrt', 'Income_BoxCox']].head())
# --- Visualization (Optional: Use Matplotlib or Plotly for distribution plots) ---
# Example using Matplotlib (Plotly JSON can be large for distributions)
fig, axes = plt.subplots(1, 4, figsize=(16, 4))
axes[0].hist(df_trans['Income'], bins=30, color='#4dabf7')
axes[0].set_title('Original Income')
axes[1].hist(df_trans['Income_Log'], bins=30, color='#9775fa')
axes[1].set_title('Log Transformed')
axes[2].hist(df_trans['Income_Sqrt'], bins=30, color='#69db7c')
axes[2].set_title('Square Root Transformed')
axes[3].hist(df_trans['Income_BoxCox'], bins=30, color='#ff922b')
axes[3].set_title('Box-Cox Transformed')
plt.tight_layout()
# If using in a web context, you'd save this plot or generate a Plotly version.
# plt.show() # Uncomment to display plot locally
The histograms illustrate how transformations can reduce skewness. The original income data is heavily right-skewed. Log and Box-Cox transformations produce distributions much closer to symmetrical, while the square root transformation offers a milder effect.
When to Apply Transformations:
Generating features from numerical data is often an iterative process. You might try binning, polynomial features, or transformations, train a model, evaluate its performance, and then refine your feature engineering strategy based on the results. The techniques discussed here provide a solid foundation for creating more informative numerical features for your machine learning models.
© 2025 ApX Machine Learning