Our exploration doesn't stop at understanding existing variables. A significant part of preparing data for modeling, informed directly by EDA, involves creating new features from the ones already present in your dataset. This process, often called feature engineering, can significantly improve the performance of machine learning models by highlighting underlying patterns or relationships that weren't explicit in the original data.
Think of it as crafting better tools for your analysis or modeling task. Insights gained from viewing histograms, scatter plots, and frequency counts often suggest ways to combine, transform, or extract information more effectively. Let's look at some common techniques.
Sometimes, the combined effect of two features is more informative than their individual effects. For example, in a housing dataset, the total area of a room (length * width
) might be a better predictor of price than length or width alone. Similarly, in a marketing context, the interaction between a user's age
and their time_spent_on_site
might reveal specific engagement patterns within age groups.
Creating these interaction terms in Pandas is often straightforward using basic arithmetic operations.
import pandas as pd
# Sample DataFrame
data = {'length': [10, 12, 8, 15],
'width': [5, 6, 4, 7],
'price': [500, 700, 300, 1000]}
df = pd.DataFrame(data)
# Create an 'area' interaction feature
df['area'] = df['length'] * df['width']
print(df)
# length width price area
# 0 10 5 500 50
# 1 12 6 700 72
# 2 8 4 300 32
# 3 15 7 1000 105
You might consider creating interaction terms when bivariate analysis (like scatter plots between numerical variables or grouped analysis for numerical vs. categorical) suggests that the relationship between one variable and a target depends on the level of another variable.
Scatter plots might reveal non-linear relationships between variables. A simple linear model might not capture such curves effectively. Creating polynomial features involves adding powers (squared, cubed, etc.) of an existing numerical feature. This allows linear models to fit non-linear patterns.
For instance, if age
relates to income
in a curved manner (perhaps increasing initially, then leveling off or decreasing), adding age^2
as a feature can help capture this.
# Sample DataFrame (adding age)
data = {'age': [25, 35, 45, 55, 65],
'income': [50, 80, 95, 100, 90]}
df_poly = pd.DataFrame(data)
# Create a squared age feature
df_poly['age_squared'] = df_poly['age'] ** 2
print(df_poly)
# age income age_squared
# 0 25 50 625
# 1 35 80 1225
# 2 45 95 2025
# 3 55 100 3025
# 4 65 90 4225
Adding polynomial features should be guided by visualizations. If a scatter plot of y vs x shows a clear parabolic shape, adding x2 might be beneficial. Be cautious, however, as adding too many high-order polynomial features can lead to overfitting. Libraries like Scikit-learn provide dedicated tools (PolynomialFeatures
) for generating these systematically.
Date and time columns often contain rich, latent information. A single timestamp like 2023-10-27 15:30:00
can be broken down into components that might individually correlate with your target variable:
Pandas provides convenient accessor methods via the .dt
attribute for Series with datetime objects.
# Sample DataFrame with a datetime column
dates = pd.to_datetime(['2023-01-15 10:00:00', '2023-01-16 18:30:00', '2023-02-20 09:15:00'])
df_dates = pd.DataFrame({'timestamp': dates, 'value': [10, 15, 12]})
# Ensure the column is datetime type
df_dates['timestamp'] = pd.to_datetime(df_dates['timestamp'])
# Extract components
df_dates['year'] = df_dates['timestamp'].dt.year
df_dates['month'] = df_dates['timestamp'].dt.month
df_dates['dayofweek'] = df_dates['timestamp'].dt.dayofweek # Monday=0, Sunday=6
df_dates['hour'] = df_dates['timestamp'].dt.hour
df_dates['is_weekend'] = df_dates['dayofweek'].isin([5, 6]).astype(int) # 1 if weekend, 0 otherwise
print(df_dates)
# timestamp value year month dayofweek hour is_weekend
# 0 2023-01-15 10:00:00 10 2023 1 6 10 1
# 1 2023-01-16 18:30:00 15 2023 1 0 18 0
# 2 2023-02-20 09:15:00 12 2023 2 0 9 0
Analyzing time series plots or comparing distributions across different time components (e.g., box plots of sales per day of the week) during EDA can guide which components are worth extracting.
Sometimes it's useful to convert a continuous numerical variable into discrete categories or bins. This can help capture non-linear effects or simplify the model. For example, instead of using exact age
, you might group individuals into age_group
categories like '18-25', '26-40', '41-60', '60+'.
Histograms generated during univariate analysis are often the primary motivation for binning. If the histogram shows distinct groups or if you believe the relationship with the target variable changes significantly across certain thresholds, binning can be effective. Pandas cut
function is ideal for this.
# Using the previous df_poly example
age_bins = [18, 25, 40, 60, 100] # Define bin edges
age_labels = ['18-25', '26-40', '41-60', '61+'] # Define labels for bins
df_poly['age_group'] = pd.cut(df_poly['age'], bins=age_bins, labels=age_labels, right=True)
print(df_poly)
# age income age_squared age_group
# 0 25 50 625 18-25
# 1 35 80 1225 26-40
# 2 45 95 2025 41-60
# 3 55 100 3025 41-60
# 4 65 90 4225 61+
You can choose between equal-width bins, equal-frequency bins (using qcut
), or custom bins based on domain knowledge or EDA observations.
Categorical features sometimes have many categories, some of which occur very infrequently. These rare categories might add noise rather than signal. During EDA, examining frequency counts (using .value_counts()
) reveals such rare categories. You might decide to group them into a single 'Other' category.
# Sample DataFrame with a categorical feature
data_cat = {'category': ['A', 'B', 'A', 'C', 'B', 'A', 'D', 'E', 'B', 'D'],
'value': [10, 15, 12, 20, 18, 11, 25, 30, 16, 22]}
df_cat = pd.DataFrame(data_cat)
# Identify infrequent categories (e.g., occurring less than 3 times)
counts = df_cat['category'].value_counts()
rare_categories = counts[counts < 3].index.tolist() # ['C', 'E']
# Replace rare categories with 'Other'
df_cat['category_grouped'] = df_cat['category'].replace(rare_categories, 'Other')
print(df_cat)
# category value category_grouped
# 0 A 10 A
# 1 B 15 B
# 2 A 12 A
# 3 C 20 Other
# 4 B 18 B
# 5 A 11 A
# 6 D 25 D
# 7 E 30 Other
# 8 B 16 B
# 9 D 22 D
print("\nNew Counts:")
print(df_cat['category_grouped'].value_counts())
# A 3
# B 3
# D 2
# Other 2
# Name: category_grouped, dtype: int64
This simplifies the feature and can sometimes improve model stability. The threshold for defining "rare" depends on the dataset size and the specific problem.
Creating new features is an iterative process. You might create a feature, visualize its relationship with other variables or the target, and then refine it or try a different approach. The techniques discussed here provide a solid foundation for translating your EDA insights into more powerful inputs for subsequent analysis or modeling stages. Remember that domain knowledge often plays a significant role in suggesting meaningful feature transformations.
© 2025 ApX Machine Learning