Applying techniques to transform raw data into more informative features and to select the most impactful ones is a practical exercise. This involves using Python libraries such as Pandas and Scikit-learn. The primary objective is not merely to run code, but to understand why specific transformations and selection methods are applied.We'll work with a dataset representing customer information and their likelihood to purchase a specific product. Imagine we've already performed the initial data loading and basic cleaning steps covered in Chapter 1.Setup and Data PreparationFirst, let's import the necessary libraries and create a sample DataFrame. In a real project, you would load your data using pd.read_csv or similar functions.import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.preprocessing import ( KBinsDiscretizer, PolynomialFeatures, StandardScaler, OneHotEncoder, OrdinalEncoder ) from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.feature_selection import SelectKBest, f_classif, RFE from sklearn.decomposition import PCA from sklearn.ensemble import RandomForestClassifier # For feature importance import matplotlib.pyplot as plt import seaborn as sns # Sample Data (replace with your actual data loading) data = { 'CustomerID': range(1, 101), 'Age': np.random.randint(18, 70, 100), 'Income': np.random.normal(50000, 15000, 100).clip(10000), 'AccountBalance': np.random.normal(10000, 5000, 100).clip(0), 'NumTransactions': np.random.randint(0, 50, 100), 'EducationLevel': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], 100, p=[0.3, 0.4, 0.2, 0.1]), 'Region': np.random.choice(['North', 'South', 'East', 'West'], 100), 'LastFeedback': np.random.choice(['Positive experience', 'Neutral', 'Issue resolved', 'Complaint filed', 'No feedback'], 100, p=[0.3, 0.2, 0.2, 0.1, 0.2]), 'Purchased': np.random.randint(0, 2, 100) # Target variable } df = pd.DataFrame(data) # Separate features (X) and target (y) X = df.drop(['CustomerID', 'Purchased'], axis=1) y = df['Purchased'] # Split data for realistic evaluation (important for target encoding, etc.) # We'll apply transformations to the training set and then apply the *same* # fitted transformations to the test set later. For simplicity in this # exercise, we might apply some transformations to X directly, but remember # the train/test split principle for actual model building. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y) print("Original Training Data Shape:", X_train.shape) print(X_train.head())Generating Features from Numerical DataLet's apply some techniques we learned to the numerical columns: Age, Income, AccountBalance, and NumTransactions.1. Binning Numerical DataBinning can help capture non-linear effects or group continuous variables into meaningful categories. Let's bin Age into categories.# Bin Age into 4 quantiles binner = KBinsDiscretizer(n_bins=4, encode='ordinal', strategy='quantile', subsample=None) # Use subsample=None for exact quantiles on smaller data # Fit on training data and transform X_train['Age_Binned'] = binner.fit_transform(X_train[['Age']]) # Transform test data using the *fitted* binner X_test['Age_Binned'] = binner.transform(X_test[['Age']]) print("\nAge Binned (Training Data):") print(X_train[['Age', 'Age_Binned']].head())We used quantile-based binning here, creating bins with roughly equal numbers of samples. The encode='ordinal' assigns numerical labels (0, 1, 2, 3) to the bins.2. Polynomial FeaturesGenerating polynomial features can help models capture interaction effects and non-linear relationships. Let's create interaction terms between Income and NumTransactions.poly = PolynomialFeatures(degree=2, include_bias=False, interaction_only=False) # Select columns for polynomial features X_train_poly_subset = X_train[['Income', 'NumTransactions']] X_test_poly_subset = X_test[['Income', 'NumTransactions']] # Fit on training data and transform poly_features_train = poly.fit_transform(X_train_poly_subset) # Transform test data poly_features_test = poly.transform(X_test_poly_subset) # Get feature names poly_feature_names = poly.get_feature_names_out(['Income', 'NumTransactions']) # Create DataFrames for the new features poly_df_train = pd.DataFrame(poly_features_train, columns=poly_feature_names, index=X_train.index) poly_df_test = pd.DataFrame(poly_features_test, columns=poly_feature_names, index=X_test.index) # Add these features back to our main DataFrames (dropping original cols if desired, but keep for now) X_train = pd.concat([X_train, poly_df_train], axis=1) X_test = pd.concat([X_test, poly_df_test], axis=1) print("\nPolynomial Features Added (Training Data snippet):") print(X_train[poly_feature_names].head())This created features like Income^2, NumTransactions^2, and the interaction term Income * NumTransactions. Be mindful that this can significantly increase the number of features, especially with higher degrees.3. ScalingMany algorithms perform better when numerical features are on a similar scale. Let's apply StandardScaler.numerical_cols = ['Income', 'AccountBalance', 'NumTransactions'] # Exclude Age as we binned it scaler = StandardScaler() # Fit on training data only scaler.fit(X_train[numerical_cols]) # Transform both train and test sets X_train[numerical_cols] = scaler.transform(X_train[numerical_cols]) X_test[numerical_cols] = scaler.transform(X_test[numerical_cols]) print("\nScaled Numerical Features (Training Data snippet):") print(X_train[numerical_cols].head()) print("\nMean after scaling (should be close to 0):") print(X_train[numerical_cols].mean()) print("\nStandard Deviation after scaling (should be close to 1):") print(X_train[numerical_cols].std())Remember: Fit the scaler only on the training data to prevent information leakage from the test set into the scaling parameters (mean and standard deviation).Encoding Categorical VariablesNow let's handle the categorical columns: EducationLevel, Region, and LastFeedback.1. One-Hot EncodingRegion is a nominal variable (no inherent order). One-hot encoding is suitable.ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore') # handle_unknown='ignore' is safer for unseen categories in test set # Select column X_train_region = X_train[['Region']] X_test_region = X_test[['Region']] # Fit on training data and transform ohe_features_train = ohe.fit_transform(X_train_region) # Transform test data ohe_features_test = ohe.transform(X_test_region) # Get feature names ohe_feature_names = ohe.get_feature_names_out(['Region']) # Create DataFrames ohe_df_train = pd.DataFrame(ohe_features_train, columns=ohe_feature_names, index=X_train.index) ohe_df_test = pd.DataFrame(ohe_features_test, columns=ohe_feature_names, index=X_test.index) # Add back and drop original 'Region' X_train = pd.concat([X_train.drop('Region', axis=1), ohe_df_train], axis=1) X_test = pd.concat([X_test.drop('Region', axis=1), ohe_df_test], axis=1) print("\nOne-Hot Encoded Region (Training Data snippet):") print(X_train[ohe_feature_names].head())2. Ordinal EncodingEducationLevel has a clear order. We can use Ordinal Encoding.# Define the order explicitly education_order = ['High School', 'Bachelor', 'Master', 'PhD'] ordinal_encoder = OrdinalEncoder(categories=[education_order]) # Pass the order # Fit and transform training data X_train['EducationLevel_Encoded'] = ordinal_encoder.fit_transform(X_train[['EducationLevel']]) # Transform test data X_test['EducationLevel_Encoded'] = ordinal_encoder.transform(X_test[['EducationLevel']]) # Drop original column X_train = X_train.drop('EducationLevel', axis=1) X_test = X_test.drop('EducationLevel', axis=1) print("\nOrdinal Encoded Education Level (Training Data):") print(X_train[['EducationLevel_Encoded']].head()) Creating Features from Text DataThe LastFeedback column contains simple text. Let's use TF-IDF to convert it into numerical features.tfidf_vectorizer = TfidfVectorizer(max_features=5) # Limit features for simplicity # Fit on training data and transform tfidf_features_train = tfidf_vectorizer.fit_transform(X_train['LastFeedback']) # Transform test data tfidf_features_test = tfidf_vectorizer.transform(X_test['LastFeedback']) # Get feature names tfidf_feature_names = tfidf_vectorizer.get_feature_names_out() tfidf_feature_names = [f"feedback_{name}" for name in tfidf_feature_names] # Add prefix # Create DataFrames (TF-IDF returns sparse matrix by default, convert to dense) tfidf_df_train = pd.DataFrame(tfidf_features_train.toarray(), columns=tfidf_feature_names, index=X_train.index) tfidf_df_test = pd.DataFrame(tfidf_features_test.toarray(), columns=tfidf_feature_names, index=X_test.index) # Add back and drop original 'LastFeedback' X_train = pd.concat([X_train.drop('LastFeedback', axis=1), tfidf_df_train], axis=1) X_test = pd.concat([X_test.drop('LastFeedback', axis=1), tfidf_df_test], axis=1) print("\nTF-IDF Features from LastFeedback (Training Data snippet):") print(X_train[tfidf_feature_names].head())We limited max_features to 5 for this example. In practice, you might allow more features or use techniques like n-grams.Feature SelectionWe now have a larger set of features. Let's select the most relevant ones.1. Filter Method: SelectKBestWe can use statistical tests to score features and select the top k. Since our target Purchased is binary, f_classif (ANOVA F-value) is appropriate for numerical inputs.# Ensure all data is numeric and handle potential NaNs if any were introduced X_train_numeric = X_train.select_dtypes(include=np.number).fillna(0) # Simple imputation for demo X_test_numeric = X_test.select_dtypes(include=np.number).fillna(0) # Ensure columns match after potential drops/adds common_cols = list(set(X_train_numeric.columns) & set(X_test_numeric.columns)) X_train_numeric = X_train_numeric[common_cols] X_test_numeric = X_test_numeric[common_cols] k_best = 10 # Select top 10 features selector_kbest = SelectKBest(score_func=f_classif, k=k_best) # Fit on training data selector_kbest.fit(X_train_numeric, y_train) # Get selected feature names selected_features_mask = selector_kbest.get_support() selected_features_kbest = X_train_numeric.columns[selected_features_mask] print(f"\nTop {k_best} features selected by SelectKBest:") print(selected_features_kbest.tolist()) # You could then filter your DataFrame: # X_train_kbest = X_train_numeric[selected_features_kbest] # X_test_kbest = X_test_numeric[selected_features_kbest]2. Embedded Method: Feature Importance (Random Forest)Tree-based models calculate feature importances during training.# Use a simple RandomForest to get importances rf = RandomForestClassifier(n_estimators=50, random_state=42, n_jobs=-1) rf.fit(X_train_numeric, y_train) importances = pd.Series(rf.feature_importances_, index=X_train_numeric.columns) importances_sorted = importances.sort_values(ascending=False) print("\nFeature Importances from RandomForest:") print(importances_sorted.head(10)) # Display top 10 # Plot feature importances plt.figure(figsize=(10, 6)) sns.barplot(x=importances_sorted.head(10), y=importances_sorted.head(10).index, palette="viridis") plt.title('Top 10 Feature Importances (Random Forest)') plt.xlabel('Importance Score') plt.ylabel('Features') plt.tight_layout() plt.show()Bar chart showing the relative importance of the top 10 features as determined by a Random Forest classifier trained on the engineered features.Feature importance often provides a good indication of which engineered features are contributing most to the model's predictions.Dimensionality Reduction with PCALet's apply PCA to the scaled numerical and polynomial features to see if we can reduce dimensions while retaining variance.# Select features for PCA (scaled numerical + polynomial) pca_cols = numerical_cols + poly_feature_names X_train_pca_subset = X_train[pca_cols].fillna(0) # Ensure no NaNs X_test_pca_subset = X_test[pca_cols].fillna(0) pca = PCA(n_components=0.95) # Retain 95% of variance # Fit on training data only pca.fit(X_train_pca_subset) # Transform both sets X_train_pca = pca.transform(X_train_pca_subset) X_test_pca = pca.transform(X_test_pca_subset) print(f"\nOriginal number of features for PCA: {X_train_pca_subset.shape[1]}") print(f"Number of PCA components retaining 95% variance: {pca.n_components_}") # Optional: Create a DataFrame for PCA components pca_comp_names = [f"PCA_{i+1}" for i in range(pca.n_components_)] pca_df_train = pd.DataFrame(X_train_pca, columns=pca_comp_names, index=X_train.index) # You could add these back to X_train, possibly replacing the original columns used in PCA # Plot explained variance ratio plt.figure(figsize=(8, 5)) plt.plot(np.cumsum(pca.explained_variance_ratio_), marker='o', linestyle='--') plt.xlabel('Number of Components') plt.ylabel('Cumulative Explained Variance') plt.title('PCA Explained Variance') plt.grid(True) plt.axhline(y=0.95, color='r', linestyle='-', label='95% Variance Threshold') plt.legend() plt.show() Line plot showing the cumulative explained variance as the number of principal components increases. A horizontal line indicates the 95% variance threshold.PCA transformed our selected features into a smaller set of orthogonal components. This can be useful for visualization, noise reduction, or as input to models sensitive to high dimensionality.Final Feature SetAfter this process, X_train and X_test contain a mix of original (scaled), engineered (binned, polynomial, encoded, TF-IDF), and potentially PCA-derived features. The specific features you keep would depend on the results of your feature selection process (e.g., keeping the top k from SelectKBest or those above an importance threshold from the Random Forest).# Example: Combining selected features (replace with your actual selection) # Assuming we chose features based on RandomForest importance > 0.01 important_features = importances_sorted[importances_sorted > 0.01].index.tolist() # Keep only selected important features + potentially PCA components if used # This requires careful merging based on indices X_train_final = X_train_numeric[important_features] # Example using RF selection X_test_final = X_test_numeric[important_features] # Or, if using PCA components instead of originals: # X_train_combined = pd.concat([X_train.drop(pca_cols, axis=1), pca_df_train], axis=1) print("\nFinal Training Data Shape (Example Selection):", X_train_final.shape) print(X_train_final.head())This refined feature set (X_train_final, X_test_final) is now ready to be fed into the machine learning models we'll explore in the next chapter.ConclusionThis hands-on exercise demonstrated the practical application of feature engineering and selection techniques covered in this chapter. You've seen how to:Generate new numerical features using binning and polynomials.Encode categorical data appropriately using one-hot and ordinal methods.Create basic text features using TF-IDF.Select relevant features using statistical tests and model importances.Reduce dimensionality using PCA.Remember that feature engineering is often an iterative process. You might try different techniques, evaluate their impact on model performance (using methods discussed in Chapter 3), and refine your feature set accordingly. The transformations and selections performed here significantly alter the data representation, aiming to provide a clearer signal for predictive models.