Okay, theory is essential, but the real understanding comes from applying these techniques. Let's get our hands dirty and implement the feature creation methods we've discussed using Python, Pandas, and Scikit-learn. We'll use a small, representative dataset to illustrate interaction features, polynomial features, date/time extraction, and binning.
First, let's set up our environment and create a sample DataFrame. Imagine we have data about online orders.
import pandas as pd
import numpy as np
from sklearn.preprocessing import PolynomialFeatures, KBinsDiscretizer
# Sample data simulating online orders
data = {
'OrderID': range(1, 11),
'OrderDate': pd.to_datetime(['2023-01-15 08:30', '2023-01-16 14:00', '2023-02-10 09:15',
'2023-02-25 18:45', '2023-03-05 11:00', '2023-03-12 22:30',
'2023-04-01 07:00', '2023-04-22 16:20', '2023-05-18 10:00',
'2023-05-30 19:55']),
'ProductCategory': ['Electronics', 'Clothing', 'Groceries', 'Electronics', 'Books',
'Clothing', 'Groceries', 'Books', 'Electronics', 'Clothing'],
'Quantity': [1, 2, 5, 1, 3, 1, 10, 2, 1, 4],
'UnitPrice': [1200, 50, 5, 800, 20, 75, 3, 15, 1500, 60],
'CustomerID': [101, 102, 103, 101, 104, 105, 103, 104, 101, 102]
}
df = pd.DataFrame(data)
# A common first step: create a 'TotalPrice' feature
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']
print("Original DataFrame with TotalPrice:")
print(df.head())
Our initial DataFrame looks like this (showing the first 5 rows):
OrderID OrderDate ProductCategory Quantity UnitPrice CustomerID TotalPrice
0 1 2023-01-15 08:30:00 Electronics 1 1200 101 1200
1 2 2023-01-16 14:00:00 Clothing 2 50 102 100
2 3 2023-02-10 09:15:00 Groceries 5 5 103 25
3 4 2023-02-25 18:45:00 Electronics 1 800 101 800
4 5 2023-03-05 11:00:00 Books 3 20 104 60
Now, let's engineer some new features.
Interaction features capture the combined effect of two or more features. For instance, does the combination of Quantity
and UnitPrice
(which we already calculated as TotalPrice
) have a different impact than their individual effects? While simple multiplication (like our TotalPrice
calculation) is a basic interaction, Scikit-learn's PolynomialFeatures
provides a systematic way to generate these, especially when you want interactions between multiple features.
Let's generate interaction terms for Quantity
and UnitPrice
. We'll set interaction_only=True
to get only the product term (Quantity×UnitPrice) along with the original features. We set include_bias=False
to omit the constant term (a column of ones).
# Select features for interaction
interaction_cols = ['Quantity', 'UnitPrice']
poly_interactions = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
# Fit and transform the data
interaction_features = poly_interactions.fit_transform(df[interaction_cols])
# Get feature names for clarity
interaction_feature_names = poly_interactions.get_feature_names_out(interaction_cols)
# Create a new DataFrame with these features
df_interactions = pd.DataFrame(interaction_features, columns=interaction_feature_names, index=df.index)
print("\nInteraction Features (Quantity, UnitPrice):")
print(df_interactions.head())
Output:
Interaction Features (Quantity, UnitPrice):
Quantity UnitPrice Quantity UnitPrice
0 1.0 1200.0 1200.0
1 2.0 50.0 100.0
2 5.0 5.0 25.0
3 1.0 800.0 800.0
4 3.0 20.0 60.0
As you can see, PolynomialFeatures
generated the original features (Quantity
, UnitPrice
) and their interaction term (Quantity UnitPrice
). Notice the interaction term is exactly our manually calculated TotalPrice
. This tool becomes more powerful when you have many features and want to explore pairwise (or higher-order) interactions systematically.
Sometimes, the relationship between a feature and the target isn't linear. Polynomial features allow models to capture curves. Let's generate degree-2 polynomial features for Quantity
and UnitPrice
. This will include the original features, their interaction, and their squared terms (Quantity2, UnitPrice2).
# Select features for polynomial expansion
poly_cols = ['Quantity', 'UnitPrice']
poly_expander = PolynomialFeatures(degree=2, include_bias=False)
# Fit and transform
polynomial_features = poly_expander.fit_transform(df[poly_cols])
# Get feature names
polynomial_feature_names = poly_expander.get_feature_names_out(poly_cols)
# Create DataFrame
df_polynomial = pd.DataFrame(polynomial_features, columns=polynomial_feature_names, index=df.index)
print("\nPolynomial Features (Degree 2 for Quantity, UnitPrice):")
print(df_polynomial.head())
Output:
Polynomial Features (Degree 2 for Quantity, UnitPrice):
Quantity UnitPrice Quantity^2 Quantity UnitPrice UnitPrice^2
0 1.0 1200.0 1.0 1200.0 1440000.0
1 2.0 50.0 4.0 100.0 2500.0
2 5.0 5.0 25.0 25.0 25.0
3 1.0 800.0 1.0 800.0 640000.0
4 3.0 20.0 9.0 60.0 400.0
Now we have Quantity
, UnitPrice
, Quantity2, Quantity×UnitPrice, and UnitPrice2. These new features can help linear models fit non-linear relationships. Be mindful, however, that higher degrees can lead to a large number of features and potential overfitting.
Date and time information often contains valuable patterns related to trends, seasonality, or specific events. Pandas provides the convenient .dt
accessor for datetime columns. Let's extract various components from our OrderDate
column.
# Ensure OrderDate is datetime type (already done in setup)
# df['OrderDate'] = pd.to_datetime(df['OrderDate'])
# Extract components
df['OrderYear'] = df['OrderDate'].dt.year
df['OrderMonth'] = df['OrderDate'].dt.month
df['OrderDay'] = df['OrderDate'].dt.day
df['OrderDayOfWeek'] = df['OrderDate'].dt.dayofweek # Monday=0, Sunday=6
df['OrderHour'] = df['OrderDate'].dt.hour
df['OrderIsWeekend'] = df['OrderDayOfWeek'].isin([5, 6]).astype(int) # Saturday or Sunday
print("\nDataFrame with Extracted Date/Time Features:")
# Display relevant columns
print(df[['OrderDate', 'OrderYear', 'OrderMonth', 'OrderDay', 'OrderDayOfWeek', 'OrderHour', 'OrderIsWeekend']].head())
Output:
DataFrame with Extracted Date/Time Features:
OrderDate OrderYear OrderMonth OrderDay OrderDayOfWeek OrderHour OrderIsWeekend
0 2023-01-15 08:30:00 2023 1 15 6 8 1
1 2023-01-16 14:00:00 2023 1 16 0 14 0
2 2023-02-10 09:15:00 2023 2 10 4 9 0
3 2023-02-25 18:45:00 2023 2 25 5 18 1
4 2023-03-05 11:00:00 2023 3 5 6 11 1
These new features (Year, Month, Day, DayOfWeek, Hour, IsWeekend) are now numerical and can reveal patterns like "more sales on weekends" or "higher electronics purchases in Q4".
Binning (or discretization) converts continuous numerical features into categorical ones by grouping values into intervals (bins). This can sometimes help models by capturing non-linear effects or simplifying relationships.
Let's bin the UnitPrice
feature into categories like 'Low', 'Medium', and 'High' price ranges using pd.cut
. This creates bins of equal width based on the range of values.
# Binning UnitPrice using pd.cut (equal width bins)
price_bins = [0, 100, 1000, df['UnitPrice'].max()] # Define bin edges
price_labels = ['Low', 'Medium', 'High'] # Define labels for the bins
df['PriceCategory_Cut'] = pd.cut(df['UnitPrice'], bins=price_bins, labels=price_labels, right=True, include_lowest=True)
print("\nDataFrame with Binned UnitPrice (pd.cut):")
print(df[['UnitPrice', 'PriceCategory_Cut']].head())
Output:
DataFrame with Binned UnitPrice (pd.cut):
UnitPrice PriceCategory_Cut
0 1200 High
1 50 Low
2 5 Low
3 800 Medium
4 20 Low
Alternatively, we can use pd.qcut
to create bins based on quantiles (equal frequency). This ensures roughly the same number of observations fall into each bin. Let's bin Quantity
into 3 quantile bins.
# Binning Quantity using pd.qcut (quantile-based bins)
quantity_labels = ['Low Qty', 'Medium Qty', 'High Qty']
df['QuantityCategory_QCut'] = pd.qcut(df['Quantity'], q=3, labels=quantity_labels, duplicates='drop')
print("\nDataFrame with Binned Quantity (pd.qcut):")
print(df[['Quantity', 'QuantityCategory_QCut']].head())
Output:
DataFrame with Binned Quantity (pd.qcut):
Quantity QuantityCategory_QCut
0 1 Low Qty
1 2 Medium Qty
2 5 High Qty
3 1 Low Qty
4 3 Medium Qty
Remember that these binned features are now categorical. You would typically need to encode them (e.g., using One-Hot Encoding or Ordinal Encoding, covered in Chapter 3) before feeding them into most machine learning models.
For integration into Scikit-learn pipelines, KBinsDiscretizer
offers similar functionality to pd.cut
and pd.qcut
but within the Scikit-learn transformer API.
# Example using KBinsDiscretizer (alternative for pipelines)
# from sklearn.preprocessing import KBinsDiscretizer
# kbins = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform') # 'uniform' ~ cut, 'quantile' ~ qcut
# df['PriceCategory_KBins'] = kbins.fit_transform(df[['UnitPrice']])
# print("\nDataFrame with Binned UnitPrice (KBinsDiscretizer):")
# print(df[['UnitPrice', 'PriceCategory_KBins']].head())
After creating these features, you'll typically want to combine them into a single DataFrame for model training. You can achieve this using pd.concat
or by assigning the new columns directly, as we did with the date/time features and binned features. For features generated by Scikit-learn transformers like PolynomialFeatures
, you often concatenate the resulting NumPy arrays or DataFrames with your original data (making sure indices align).
Here's our DataFrame now, incorporating the date/time and binned features directly:
print("\nFinal DataFrame with Selected Engineered Features:")
print(df.head())
Output:
Final DataFrame with Selected Engineered Features:
OrderID OrderDate ProductCategory Quantity UnitPrice CustomerID TotalPrice OrderYear OrderMonth OrderDay OrderDayOfWeek OrderHour OrderIsWeekend PriceCategory_Cut QuantityCategory_QCut
0 1 2023-01-15 08:30:00 Electronics 1 1200 101 1200 2023 1 15 6 8 1 High Low Qty
1 2 2023-01-16 14:00:00 Clothing 2 50 102 100 2023 1 16 0 14 0 Low Medium Qty
2 3 2023-02-10 09:15:00 Groceries 5 5 103 25 2023 2 10 4 9 0 Low High Qty
3 4 2023-02-25 18:45:00 Electronics 1 800 101 800 2023 2 25 5 18 1 Medium Low Qty
4 5 2023-03-05 11:00:00 Books 3 20 104 60 2023 3 5 6 11 1 Low Medium Qty
This practical exercise demonstrated how to translate the concepts of interaction features, polynomial features, date/time extraction, and binning into concrete code using Pandas and Scikit-learn. Remember that feature creation is often iterative. You might try creating several features, evaluating their impact on your model (using techniques discussed in feature selection), and refining your feature set based on the results and your understanding of the problem domain.
© 2025 ApX Machine Learning