For a real understanding of feature engineering, practical application is essential. Feature creation methods are implemented using Python, Pandas, and Scikit-learn. A small, representative dataset illustrates interaction features, polynomial features, date/time extraction, and binning.First, let's set up our environment and create a sample DataFrame. Imagine we have data about online orders.import pandas as pd import numpy as np from sklearn.preprocessing import PolynomialFeatures, KBinsDiscretizer # Sample data simulating online orders data = { 'OrderID': range(1, 11), 'OrderDate': pd.to_datetime(['2023-01-15 08:30', '2023-01-16 14:00', '2023-02-10 09:15', '2023-02-25 18:45', '2023-03-05 11:00', '2023-03-12 22:30', '2023-04-01 07:00', '2023-04-22 16:20', '2023-05-18 10:00', '2023-05-30 19:55']), 'ProductCategory': ['Electronics', 'Clothing', 'Groceries', 'Electronics', 'Books', 'Clothing', 'Groceries', 'Books', 'Electronics', 'Clothing'], 'Quantity': [1, 2, 5, 1, 3, 1, 10, 2, 1, 4], 'UnitPrice': [1200, 50, 5, 800, 20, 75, 3, 15, 1500, 60], 'CustomerID': [101, 102, 103, 101, 104, 105, 103, 104, 101, 102] } df = pd.DataFrame(data) # A common first step: create a 'TotalPrice' feature df['TotalPrice'] = df['Quantity'] * df['UnitPrice'] print("Original DataFrame with TotalPrice:") print(df.head())Our initial DataFrame looks like this (showing the first 5 rows): OrderID OrderDate ProductCategory Quantity UnitPrice CustomerID TotalPrice 0 1 2023-01-15 08:30:00 Electronics 1 1200 101 1200 1 2 2023-01-16 14:00:00 Clothing 2 50 102 100 2 3 2023-02-10 09:15:00 Groceries 5 5 103 25 3 4 2023-02-25 18:45:00 Electronics 1 800 101 800 4 5 2023-03-05 11:00:00 Books 3 20 104 60Now, let's engineer some new features.Creating Interaction FeaturesInteraction features capture the combined effect of two or more features. For instance, does the combination of Quantity and UnitPrice (which we already calculated as TotalPrice) have a different impact than their individual effects? While simple multiplication (like our TotalPrice calculation) is a basic interaction, Scikit-learn's PolynomialFeatures provides a systematic way to generate these, especially when you want interactions between multiple features.Let's generate interaction terms for Quantity and UnitPrice. We'll set interaction_only=True to get only the product term ($Quantity \times UnitPrice$) along with the original features. We set include_bias=False to omit the constant term (a column of ones).# Select features for interaction interaction_cols = ['Quantity', 'UnitPrice'] poly_interactions = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False) # Fit and transform the data interaction_features = poly_interactions.fit_transform(df[interaction_cols]) # Get feature names for clarity interaction_feature_names = poly_interactions.get_feature_names_out(interaction_cols) # Create a new DataFrame with these features df_interactions = pd.DataFrame(interaction_features, columns=interaction_feature_names, index=df.index) print("\nInteraction Features (Quantity, UnitPrice):") print(df_interactions.head())Output:Interaction Features (Quantity, UnitPrice): Quantity UnitPrice Quantity UnitPrice 0 1.0 1200.0 1200.0 1 2.0 50.0 100.0 2 5.0 5.0 25.0 3 1.0 800.0 800.0 4 3.0 20.0 60.0As you can see, PolynomialFeatures generated the original features (Quantity, UnitPrice) and their interaction term (Quantity UnitPrice). Notice the interaction term is exactly our manually calculated TotalPrice. This tool becomes more powerful when you have many features and want to explore pairwise (or higher-order) interactions systematically.Generating Polynomial FeaturesSometimes, the relationship between a feature and the target isn't linear. Polynomial features allow models to capture curves. Let's generate degree-2 polynomial features for Quantity and UnitPrice. This will include the original features, their interaction, and their squared terms ($Quantity^2$, $UnitPrice^2$).# Select features for polynomial expansion poly_cols = ['Quantity', 'UnitPrice'] poly_expander = PolynomialFeatures(degree=2, include_bias=False) # Fit and transform polynomial_features = poly_expander.fit_transform(df[poly_cols]) # Get feature names polynomial_feature_names = poly_expander.get_feature_names_out(poly_cols) # Create DataFrame df_polynomial = pd.DataFrame(polynomial_features, columns=polynomial_feature_names, index=df.index) print("\nPolynomial Features (Degree 2 for Quantity, UnitPrice):") print(df_polynomial.head()) Output:Polynomial Features (Degree 2 for Quantity, UnitPrice): Quantity UnitPrice Quantity^2 Quantity UnitPrice UnitPrice^2 0 1.0 1200.0 1.0 1200.0 1440000.0 1 2.0 50.0 4.0 100.0 2500.0 2 5.0 5.0 25.0 25.0 25.0 3 1.0 800.0 1.0 800.0 640000.0 4 3.0 20.0 9.0 60.0 400.0Now we have Quantity, UnitPrice, $Quantity^2$, $Quantity \times UnitPrice$, and $UnitPrice^2$. These new features can help linear models fit non-linear relationships. Be mindful, however, that higher degrees can lead to a large number of features and potential overfitting.Extracting Features from Date/Time DataDate and time information often contains valuable patterns related to trends, seasonality, or specific events. Pandas provides the convenient .dt accessor for datetime columns. Let's extract various components from our OrderDate column.# Ensure OrderDate is datetime type (already done in setup) # df['OrderDate'] = pd.to_datetime(df['OrderDate']) # Extract components df['OrderYear'] = df['OrderDate'].dt.year df['OrderMonth'] = df['OrderDate'].dt.month df['OrderDay'] = df['OrderDate'].dt.day df['OrderDayOfWeek'] = df['OrderDate'].dt.dayofweek # Monday=0, Sunday=6 df['OrderHour'] = df['OrderDate'].dt.hour df['OrderIsWeekend'] = df['OrderDayOfWeek'].isin([5, 6]).astype(int) # Saturday or Sunday print("\nDataFrame with Extracted Date/Time Features:") # Display relevant columns print(df[['OrderDate', 'OrderYear', 'OrderMonth', 'OrderDay', 'OrderDayOfWeek', 'OrderHour', 'OrderIsWeekend']].head()) Output:DataFrame with Extracted Date/Time Features: OrderDate OrderYear OrderMonth OrderDay OrderDayOfWeek OrderHour OrderIsWeekend 0 2023-01-15 08:30:00 2023 1 15 6 8 1 1 2023-01-16 14:00:00 2023 1 16 0 14 0 2 2023-02-10 09:15:00 2023 2 10 4 9 0 3 2023-02-25 18:45:00 2023 2 25 5 18 1 4 2023-03-05 11:00:00 2023 3 5 6 11 1These new features (Year, Month, Day, DayOfWeek, Hour, IsWeekend) are now numerical and can reveal patterns like "more sales on weekends" or "higher electronics purchases in Q4".Binning Numerical FeaturesBinning (or discretization) converts continuous numerical features into categorical ones by grouping values into intervals (bins). This can sometimes help models by capturing non-linear effects or simplifying relationships.Let's bin the UnitPrice feature into categories like 'Low', 'Medium', and 'High' price ranges using pd.cut. This creates bins of equal width based on the range of values.# Binning UnitPrice using pd.cut (equal width bins) price_bins = [0, 100, 1000, df['UnitPrice'].max()] # Define bin edges price_labels = ['Low', 'Medium', 'High'] # Define labels for the bins df['PriceCategory_Cut'] = pd.cut(df['UnitPrice'], bins=price_bins, labels=price_labels, right=True, include_lowest=True) print("\nDataFrame with Binned UnitPrice (pd.cut):") print(df[['UnitPrice', 'PriceCategory_Cut']].head())Output:DataFrame with Binned UnitPrice (pd.cut): UnitPrice PriceCategory_Cut 0 1200 High 1 50 Low 2 5 Low 3 800 Medium 4 20 LowAlternatively, we can use pd.qcut to create bins based on quantiles (equal frequency). This ensures roughly the same number of observations fall into each bin. Let's bin Quantity into 3 quantile bins.# Binning Quantity using pd.qcut (quantile-based bins) quantity_labels = ['Low Qty', 'Medium Qty', 'High Qty'] df['QuantityCategory_QCut'] = pd.qcut(df['Quantity'], q=3, labels=quantity_labels, duplicates='drop') print("\nDataFrame with Binned Quantity (pd.qcut):") print(df[['Quantity', 'QuantityCategory_QCut']].head())Output:DataFrame with Binned Quantity (pd.qcut): Quantity QuantityCategory_QCut 0 1 Low Qty 1 2 Medium Qty 2 5 High Qty 3 1 Low Qty 4 3 Medium QtyRemember that these binned features are now categorical. You would typically need to encode them (e.g., using One-Hot Encoding or Ordinal Encoding, covered in Chapter 3) before feeding them into most machine learning models.For integration into Scikit-learn pipelines, KBinsDiscretizer offers similar functionality to pd.cut and pd.qcut but within the Scikit-learn transformer API.# Example using KBinsDiscretizer (alternative for pipelines) # from sklearn.preprocessing import KBinsDiscretizer # kbins = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform') # 'uniform' ~ cut, 'quantile' ~ qcut # df['PriceCategory_KBins'] = kbins.fit_transform(df[['UnitPrice']]) # print("\nDataFrame with Binned UnitPrice (KBinsDiscretizer):") # print(df[['UnitPrice', 'PriceCategory_KBins']].head())Consolidating New FeaturesAfter creating these features, you'll typically want to combine them into a single DataFrame for model training. You can achieve this using pd.concat or by assigning the new columns directly, as we did with the date/time features and binned features. For features generated by Scikit-learn transformers like PolynomialFeatures, you often concatenate the resulting NumPy arrays or DataFrames with your original data (making sure indices align).Here's our DataFrame now, incorporating the date/time and binned features directly:print("\nFinal DataFrame with Selected Engineered Features:") print(df.head())Output:Final DataFrame with Selected Engineered Features: OrderID OrderDate ProductCategory Quantity UnitPrice CustomerID TotalPrice OrderYear OrderMonth OrderDay OrderDayOfWeek OrderHour OrderIsWeekend PriceCategory_Cut QuantityCategory_QCut 0 1 2023-01-15 08:30:00 Electronics 1 1200 101 1200 2023 1 15 6 8 1 High Low Qty 1 2 2023-01-16 14:00:00 Clothing 2 50 102 100 2023 1 16 0 14 0 Low Medium Qty 2 3 2023-02-10 09:15:00 Groceries 5 5 103 25 2023 2 10 4 9 0 Low High Qty 3 4 2023-02-25 18:45:00 Electronics 1 800 101 800 2023 2 25 5 18 1 Medium Low Qty 4 5 2023-03-05 11:00:00 Books 3 20 104 60 2023 3 5 6 11 1 Low Medium QtyThis practical exercise demonstrated how to translate the concepts of interaction features, polynomial features, date/time extraction, and binning into concrete code using Pandas and Scikit-learn. Remember that feature creation is often iterative. You might try creating several features, evaluating their impact on your model (using techniques discussed in feature selection), and refining your feature set based on the results and your understanding of the problem domain.