As mentioned in the chapter introduction, machine learning models generally require numerical input data that is clean and well-structured. Raw data often contains features that, while informative, may not be in the most effective format for an algorithm to learn patterns. Feature engineering is the process of using domain knowledge and data manipulation techniques to create new input features (predictors) from your existing raw data. The goal is to improve model performance by providing more relevant, informative, or appropriately scaled representations of the underlying information.
Think of it as crafting better ingredients for your machine learning recipe. Instead of just feeding the raw vegetables (original features) into the pot, you might chop, combine, or even derive new elements (engineered features) that make the final dish (model prediction) much better. Well-designed features can significantly simplify the modeling task, sometimes allowing simpler, more interpretable models to achieve high accuracy. Conversely, even sophisticated models can struggle if the input features do not adequately capture the relevant patterns in the data.
The quality and format of input features directly influence a model's ability to learn. Here’s why feature engineering is a significant step:
Feature engineering is often a creative process guided by data exploration and domain expertise. However, several common techniques are widely applicable:
Interaction features capture the combined effect of two or more features. If you suspect that the influence of one feature depends on the value of another, creating an interaction term can be beneficial.
number_of_bedrooms
might interact with square_footage
. You could create bedrooms_x_sqft = number_of_bedrooms * square_footage
.region
and product_type
might create a more specific feature like region_product
.income_in_region_A
, income_in_region_B
.import pandas as pd
# Sample DataFrame
data = {'price': [10, 15, 12], 'quantity': [5, 8, 6]}
df = pd.DataFrame(data)
# Create an interaction feature: total_revenue
df['total_revenue'] = df['price'] * df['quantity']
print(df)
# price quantity total_revenue
# 0 10 5 50
# 1 15 8 120
# 2 12 6 72
If you suspect a non-linear relationship between a feature and the target variable, you can create polynomial features by adding powers of the original feature (e.g., x2, x3) or interaction terms between features (e.g., x1x2). This is particularly useful for linear models that assume linear relationships.
Scikit-learn provides a convenient transformer for this: sklearn.preprocessing.PolynomialFeatures
.
from sklearn.preprocessing import PolynomialFeatures
import numpy as np
# Sample data (1 feature)
X = np.array([[2], [3], [4]]) # Needs to be 2D for PolynomialFeatures
# Create features up to degree 2 (includes interaction if multiple input features)
# include_bias=False avoids adding a column of ones (intercept)
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
print(f"Original Features:\n{X}")
print(f"\nPolynomial Features (degree 2):\n{X_poly}")
# Original Features:
# [[2]
# [3]
# [4]]
#
# Polynomial Features (degree 2):
# [[ 2. 4.] -> x, x^2 for input 2
# [ 3. 9.] -> x, x^2 for input 3
# [ 4. 16.]] -> x, x^2 for input 4
Be cautious with high degrees, as this can lead to a large number of features and increase the risk of overfitting.
Binning involves converting continuous numerical features into discrete categorical features by grouping values into intervals or "bins".
Pandas' cut
function is useful for creating bins.
import pandas as pd
# Sample data
ages = pd.Series([22, 35, 58, 19, 41, 73, 28])
# Define bin edges and labels
bins = [0, 18, 35, 60, 100]
labels = ['<18', '18-35', '36-60', '>60']
# Create binned feature
ages_binned = pd.cut(ages, bins=bins, labels=labels, right=False) # right=False means [0, 18), [18, 35), etc.
print(ages_binned)
# 0 18-35
# 1 36-60 <- Note: 35 falls into [36, 60) because right=False
# 2 36-60
# 3 18-35
# 4 36-60
# 5 >60
# 6 18-35
# dtype: category
# Categories (4, object): ['<18' < '18-35' < '36-60' < '>60']
# If using right=True (default):
# ages_binned_right = pd.cut(ages, bins=bins, labels=labels, right=True)
# print(ages_binned_right)
# 0 18-35 -> (18, 35]
# 1 18-35 -> (18, 35]
# ...
Applying mathematical transformations like logarithm, square root, or reciprocal can help stabilize variance, make distributions more normal, or linearize relationships. For instance, taking the logarithm (log(x)) of a highly skewed feature (like income or population) often results in a more symmetric distribution, which can be beneficial for some models.
Date and time features often contain valuable information that isn't directly usable in its raw format (e.g., '2023-10-27 10:30:00'). You can extract numerous features:
Pandas provides powerful datetime capabilities via the .dt
accessor on Series with datetime objects.
import pandas as pd
# Sample datetime data
dates = pd.Series(pd.to_datetime(['2023-01-15', '2023-05-20', '2024-12-25']))
# Extract features
df_dates = pd.DataFrame({
'year': dates.dt.year,
'month': dates.dt.month,
'dayofweek': dates.dt.dayofweek, # Monday=0, Sunday=6
'is_weekend': dates.dt.dayofweek >= 5
})
print(df_dates)
# year month dayofweek is_weekend
# 0 2023 1 6 True
# 1 2023 5 5 True
# 2 2024 12 2 False
Perhaps the most impactful, yet least automated, aspect of feature engineering is using your understanding of the problem domain. If you're predicting customer churn, you might know that the ratio of support_calls
to contract_duration
is a meaningful indicator. If analyzing sensor data, the rate of change (value_t - value_{t-1}
) might be more informative than the raw values. This often involves combining existing features in ways suggested by expert knowledge rather than purely statistical patterns.
Feature engineering isn't performed in isolation. It's an integral part of the data preparation pipeline, often revisited iteratively:
This iterative process, combining data analysis, domain expertise, and experimentation, is fundamental to building effective machine learning models. The techniques discussed here provide a foundation for transforming raw data into powerful predictive features.
© 2025 ApX Machine Learning