Alright, let's put the theory from this chapter into practice. We've discussed various ways to transform raw data into more informative features and how to select the most impactful ones. This hands-on section will guide you through applying these techniques using Python libraries like Pandas and Scikit-learn. The goal isn't just to run code, but to understand why we're applying specific transformations and selection methods.
We'll work with a dataset representing customer information and their likelihood to purchase a specific product. Imagine we've already performed the initial data loading and basic cleaning steps covered in Chapter 1.
First, let's import the necessary libraries and create a sample DataFrame. In a real project, you would load your data using pd.read_csv
or similar functions.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import (
KBinsDiscretizer, PolynomialFeatures, StandardScaler, OneHotEncoder, OrdinalEncoder
)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, f_classif, RFE
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier # For feature importance
import matplotlib.pyplot as plt
import seaborn as sns
# Sample Data (replace with your actual data loading)
data = {
'CustomerID': range(1, 101),
'Age': np.random.randint(18, 70, 100),
'Income': np.random.normal(50000, 15000, 100).clip(10000),
'AccountBalance': np.random.normal(10000, 5000, 100).clip(0),
'NumTransactions': np.random.randint(0, 50, 100),
'EducationLevel': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], 100, p=[0.3, 0.4, 0.2, 0.1]),
'Region': np.random.choice(['North', 'South', 'East', 'West'], 100),
'LastFeedback': np.random.choice(['Positive experience', 'Neutral', 'Issue resolved', 'Complaint filed', 'No feedback'], 100, p=[0.3, 0.2, 0.2, 0.1, 0.2]),
'Purchased': np.random.randint(0, 2, 100) # Target variable
}
df = pd.DataFrame(data)
# Separate features (X) and target (y)
X = df.drop(['CustomerID', 'Purchased'], axis=1)
y = df['Purchased']
# Split data for realistic evaluation (important for target encoding, etc.)
# We'll apply transformations to the training set and then apply the *same*
# fitted transformations to the test set later. For simplicity in this
# exercise, we might apply some transformations to X directly, but remember
# the train/test split principle for actual model building.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
print("Original Training Data Shape:", X_train.shape)
print(X_train.head())
Let's apply some techniques we learned to the numerical columns: Age
, Income
, AccountBalance
, and NumTransactions
.
1. Binning Numerical Data
Binning can help capture non-linear effects or group continuous variables into meaningful categories. Let's bin Age
into categories.
# Bin Age into 4 quantiles
binner = KBinsDiscretizer(n_bins=4, encode='ordinal', strategy='quantile', subsample=None) # Use subsample=None for exact quantiles on smaller data
# Fit on training data and transform
X_train['Age_Binned'] = binner.fit_transform(X_train[['Age']])
# Transform test data using the *fitted* binner
X_test['Age_Binned'] = binner.transform(X_test[['Age']])
print("\nAge Binned (Training Data):")
print(X_train[['Age', 'Age_Binned']].head())
We used quantile-based binning here, creating bins with roughly equal numbers of samples. The encode='ordinal'
assigns numerical labels (0, 1, 2, 3) to the bins.
2. Polynomial Features
Generating polynomial features can help models capture interaction effects and non-linear relationships. Let's create interaction terms between Income
and NumTransactions
.
poly = PolynomialFeatures(degree=2, include_bias=False, interaction_only=False)
# Select columns for polynomial features
X_train_poly_subset = X_train[['Income', 'NumTransactions']]
X_test_poly_subset = X_test[['Income', 'NumTransactions']]
# Fit on training data and transform
poly_features_train = poly.fit_transform(X_train_poly_subset)
# Transform test data
poly_features_test = poly.transform(X_test_poly_subset)
# Get feature names
poly_feature_names = poly.get_feature_names_out(['Income', 'NumTransactions'])
# Create DataFrames for the new features
poly_df_train = pd.DataFrame(poly_features_train, columns=poly_feature_names, index=X_train.index)
poly_df_test = pd.DataFrame(poly_features_test, columns=poly_feature_names, index=X_test.index)
# Add these features back to our main DataFrames (dropping original cols if desired, but keep for now)
X_train = pd.concat([X_train, poly_df_train], axis=1)
X_test = pd.concat([X_test, poly_df_test], axis=1)
print("\nPolynomial Features Added (Training Data snippet):")
print(X_train[poly_feature_names].head())
This created features like Income^2
, NumTransactions^2
, and the interaction term Income * NumTransactions
. Be mindful that this can significantly increase the number of features, especially with higher degrees.
3. Scaling
Many algorithms perform better when numerical features are on a similar scale. Let's apply StandardScaler
.
numerical_cols = ['Income', 'AccountBalance', 'NumTransactions'] # Exclude Age as we binned it
scaler = StandardScaler()
# Fit on training data only
scaler.fit(X_train[numerical_cols])
# Transform both train and test sets
X_train[numerical_cols] = scaler.transform(X_train[numerical_cols])
X_test[numerical_cols] = scaler.transform(X_test[numerical_cols])
print("\nScaled Numerical Features (Training Data snippet):")
print(X_train[numerical_cols].head())
print("\nMean after scaling (should be close to 0):")
print(X_train[numerical_cols].mean())
print("\nStandard Deviation after scaling (should be close to 1):")
print(X_train[numerical_cols].std())
Remember: Fit the scaler only on the training data to prevent information leakage from the test set into the scaling parameters (mean and standard deviation).
Now let's handle the categorical columns: EducationLevel
, Region
, and LastFeedback
.
1. One-Hot Encoding
Region
is a nominal variable (no inherent order). One-hot encoding is suitable.
ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore') # handle_unknown='ignore' is safer for unseen categories in test set
# Select column
X_train_region = X_train[['Region']]
X_test_region = X_test[['Region']]
# Fit on training data and transform
ohe_features_train = ohe.fit_transform(X_train_region)
# Transform test data
ohe_features_test = ohe.transform(X_test_region)
# Get feature names
ohe_feature_names = ohe.get_feature_names_out(['Region'])
# Create DataFrames
ohe_df_train = pd.DataFrame(ohe_features_train, columns=ohe_feature_names, index=X_train.index)
ohe_df_test = pd.DataFrame(ohe_features_test, columns=ohe_feature_names, index=X_test.index)
# Add back and drop original 'Region'
X_train = pd.concat([X_train.drop('Region', axis=1), ohe_df_train], axis=1)
X_test = pd.concat([X_test.drop('Region', axis=1), ohe_df_test], axis=1)
print("\nOne-Hot Encoded Region (Training Data snippet):")
print(X_train[ohe_feature_names].head())
2. Ordinal Encoding
EducationLevel
has a clear order. We can use Ordinal Encoding.
# Define the order explicitly
education_order = ['High School', 'Bachelor', 'Master', 'PhD']
ordinal_encoder = OrdinalEncoder(categories=[education_order]) # Pass the order
# Fit and transform training data
X_train['EducationLevel_Encoded'] = ordinal_encoder.fit_transform(X_train[['EducationLevel']])
# Transform test data
X_test['EducationLevel_Encoded'] = ordinal_encoder.transform(X_test[['EducationLevel']])
# Drop original column
X_train = X_train.drop('EducationLevel', axis=1)
X_test = X_test.drop('EducationLevel', axis=1)
print("\nOrdinal Encoded Education Level (Training Data):")
print(X_train[['EducationLevel_Encoded']].head())
The LastFeedback
column contains simple text. Let's use TF-IDF to convert it into numerical features.
tfidf_vectorizer = TfidfVectorizer(max_features=5) # Limit features for simplicity
# Fit on training data and transform
tfidf_features_train = tfidf_vectorizer.fit_transform(X_train['LastFeedback'])
# Transform test data
tfidf_features_test = tfidf_vectorizer.transform(X_test['LastFeedback'])
# Get feature names
tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()
tfidf_feature_names = [f"feedback_{name}" for name in tfidf_feature_names] # Add prefix
# Create DataFrames (TF-IDF returns sparse matrix by default, convert to dense)
tfidf_df_train = pd.DataFrame(tfidf_features_train.toarray(), columns=tfidf_feature_names, index=X_train.index)
tfidf_df_test = pd.DataFrame(tfidf_features_test.toarray(), columns=tfidf_feature_names, index=X_test.index)
# Add back and drop original 'LastFeedback'
X_train = pd.concat([X_train.drop('LastFeedback', axis=1), tfidf_df_train], axis=1)
X_test = pd.concat([X_test.drop('LastFeedback', axis=1), tfidf_df_test], axis=1)
print("\nTF-IDF Features from LastFeedback (Training Data snippet):")
print(X_train[tfidf_feature_names].head())
We limited max_features
to 5 for this example. In practice, you might allow more features or use techniques like n-grams.
We now have a larger set of features. Let's select the most relevant ones.
1. Filter Method: SelectKBest
We can use statistical tests to score features and select the top k. Since our target Purchased
is binary, f_classif
(ANOVA F-value) is appropriate for numerical inputs.
# Ensure all data is numeric and handle potential NaNs if any were introduced
X_train_numeric = X_train.select_dtypes(include=np.number).fillna(0) # Simple imputation for demo
X_test_numeric = X_test.select_dtypes(include=np.number).fillna(0)
# Ensure columns match after potential drops/adds
common_cols = list(set(X_train_numeric.columns) & set(X_test_numeric.columns))
X_train_numeric = X_train_numeric[common_cols]
X_test_numeric = X_test_numeric[common_cols]
k_best = 10 # Select top 10 features
selector_kbest = SelectKBest(score_func=f_classif, k=k_best)
# Fit on training data
selector_kbest.fit(X_train_numeric, y_train)
# Get selected feature names
selected_features_mask = selector_kbest.get_support()
selected_features_kbest = X_train_numeric.columns[selected_features_mask]
print(f"\nTop {k_best} features selected by SelectKBest:")
print(selected_features_kbest.tolist())
# You could then filter your DataFrame:
# X_train_kbest = X_train_numeric[selected_features_kbest]
# X_test_kbest = X_test_numeric[selected_features_kbest]
2. Embedded Method: Feature Importance (Random Forest)
Tree-based models calculate feature importances during training.
# Use a simple RandomForest to get importances
rf = RandomForestClassifier(n_estimators=50, random_state=42, n_jobs=-1)
rf.fit(X_train_numeric, y_train)
importances = pd.Series(rf.feature_importances_, index=X_train_numeric.columns)
importances_sorted = importances.sort_values(ascending=False)
print("\nFeature Importances from RandomForest:")
print(importances_sorted.head(10)) # Display top 10
# Plot feature importances
plt.figure(figsize=(10, 6))
sns.barplot(x=importances_sorted.head(10), y=importances_sorted.head(10).index, palette="viridis")
plt.title('Top 10 Feature Importances (Random Forest)')
plt.xlabel('Importance Score')
plt.ylabel('Features')
plt.tight_layout()
plt.show()
Bar chart showing the relative importance of the top 10 features as determined by a Random Forest classifier trained on the engineered features.
Feature importance often provides a good indication of which engineered features are contributing most to the model's predictions.
Let's apply PCA to the scaled numerical and polynomial features to see if we can reduce dimensions while retaining variance.
# Select features for PCA (scaled numerical + polynomial)
pca_cols = numerical_cols + poly_feature_names
X_train_pca_subset = X_train[pca_cols].fillna(0) # Ensure no NaNs
X_test_pca_subset = X_test[pca_cols].fillna(0)
pca = PCA(n_components=0.95) # Retain 95% of variance
# Fit on training data only
pca.fit(X_train_pca_subset)
# Transform both sets
X_train_pca = pca.transform(X_train_pca_subset)
X_test_pca = pca.transform(X_test_pca_subset)
print(f"\nOriginal number of features for PCA: {X_train_pca_subset.shape[1]}")
print(f"Number of PCA components retaining 95% variance: {pca.n_components_}")
# Optional: Create a DataFrame for PCA components
pca_comp_names = [f"PCA_{i+1}" for i in range(pca.n_components_)]
pca_df_train = pd.DataFrame(X_train_pca, columns=pca_comp_names, index=X_train.index)
# You could add these back to X_train, possibly replacing the original columns used in PCA
# Plot explained variance ratio
plt.figure(figsize=(8, 5))
plt.plot(np.cumsum(pca.explained_variance_ratio_), marker='o', linestyle='--')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('PCA Explained Variance')
plt.grid(True)
plt.axhline(y=0.95, color='r', linestyle='-', label='95% Variance Threshold')
plt.legend()
plt.show()
Line plot showing the cumulative explained variance as the number of principal components increases. A horizontal line indicates the 95% variance threshold.
PCA transformed our selected features into a smaller set of orthogonal components. This can be useful for visualization, noise reduction, or as input to models sensitive to high dimensionality.
After this process, X_train
and X_test
contain a mix of original (scaled), engineered (binned, polynomial, encoded, TF-IDF), and potentially PCA-derived features. The specific features you keep would depend on the results of your feature selection process (e.g., keeping the top k
from SelectKBest or those above an importance threshold from the Random Forest).
# Example: Combining selected features (replace with your actual selection)
# Assuming we chose features based on RandomForest importance > 0.01
important_features = importances_sorted[importances_sorted > 0.01].index.tolist()
# Keep only selected important features + potentially PCA components if used
# This requires careful merging based on indices
X_train_final = X_train_numeric[important_features] # Example using RF selection
X_test_final = X_test_numeric[important_features]
# Or, if using PCA components instead of originals:
# X_train_combined = pd.concat([X_train.drop(pca_cols, axis=1), pca_df_train], axis=1)
print("\nFinal Training Data Shape (Example Selection):", X_train_final.shape)
print(X_train_final.head())
This refined feature set (X_train_final
, X_test_final
) is now ready to be fed into the machine learning models we'll explore in the next chapter.
This hands-on exercise demonstrated the practical application of feature engineering and selection techniques covered in this chapter. You've seen how to:
Remember that feature engineering is often an iterative process. You might try different techniques, evaluate their impact on model performance (using methods discussed in Chapter 3), and refine your feature set accordingly. The transformations and selections performed here significantly alter the data representation, aiming to provide a clearer signal for predictive models.
Was this section helpful?
© 2025 ApX Machine Learning