All Courses

Creating New Features from Existing Ones

Our exploration doesn't stop at understanding existing variables. A significant part of preparing data for modeling, informed directly by EDA, involves creating new features from the ones already present in your dataset. This process, often called feature engineering, can significantly improve the performance of machine learning models by highlighting underlying patterns or relationships that weren't explicit in the original data.

Think of it as crafting better tools for your analysis or modeling task. Insights gained from viewing histograms, scatter plots, and frequency counts often suggest ways to combine, transform, or extract information more effectively. Let's look at some common techniques.

Interaction Features

Sometimes, the combined effect of two features is more informative than their individual effects. For example, in a housing dataset, the total area of a room (length * width) might be a better predictor of price than length or width alone. Similarly, in a marketing context, the interaction between a user's age and their time_spent_on_site might reveal specific engagement patterns within age groups.

Creating these interaction terms in Pandas is often straightforward using basic arithmetic operations.

import pandas as pd

# Sample DataFrame
data = {'length': [10, 12, 8, 15],
        'width': [5, 6, 4, 7],
        'price': [500, 700, 300, 1000]}
df = pd.DataFrame(data)

# Create an 'area' interaction feature
df['area'] = df['length'] * df['width']

print(df)
#    length  width  price  area
# 0      10      5    500    50
# 1      12      6    700    72
# 2       8      4    300    32
# 3      15      7   1000   105

You might consider creating interaction terms when bivariate analysis (like scatter plots between numerical variables or grouped analysis for numerical vs. categorical) suggests that the relationship between one variable and a target depends on the level of another variable.

Polynomial Features

Scatter plots might reveal non-linear relationships between variables. A simple linear model might not capture such curves effectively. Creating polynomial features involves adding powers (squared, cubed, etc.) of an existing numerical feature. This allows linear models to fit non-linear patterns.

For instance, if age relates to income in a curved manner (perhaps increasing initially, then leveling off or decreasing), adding age^2 as a feature can help capture this.

# Sample DataFrame (adding age)
data = {'age': [25, 35, 45, 55, 65],
        'income': [50, 80, 95, 100, 90]}
df_poly = pd.DataFrame(data)

# Create a squared age feature
df_poly['age_squared'] = df_poly['age'] ** 2

print(df_poly)
#    age  income  age_squared
# 0   25      50          625
# 1   35      80         1225
# 2   45      95         2025
# 3   55     100         3025
# 4   65      90         4225

Adding polynomial features should be guided by visualizations. If a scatter plot of $y$ vs $x$ shows a clear parabolic shape, adding $x^2$ might be beneficial. Be cautious, however, as adding too many high-order polynomial features can lead to overfitting. Libraries like Scikit-learn provide dedicated tools (PolynomialFeatures) for generating these systematically.

Extracting Information from Date/Time Features

Date and time columns often contain rich, latent information. A single timestamp like 2023-10-27 15:30:00 can be broken down into components that might individually correlate with your target variable:

Year (e.g., 2023)
Month (e.g., 10 or October)
Day of the month (e.g., 27)
Day of the week (e.g., Friday)
Hour of the day (e.g., 15)
Whether it's a weekend or weekday
Time elapsed since a specific event

Pandas provides convenient accessor methods via the .dt attribute for Series with datetime objects.

# Sample DataFrame with a datetime column
dates = pd.to_datetime(['2023-01-15 10:00:00', '2023-01-16 18:30:00', '2023-02-20 09:15:00'])
df_dates = pd.DataFrame({'timestamp': dates, 'value': [10, 15, 12]})

# Ensure the column is datetime type
df_dates['timestamp'] = pd.to_datetime(df_dates['timestamp'])

# Extract components
df_dates['year'] = df_dates['timestamp'].dt.year
df_dates['month'] = df_dates['timestamp'].dt.month
df_dates['dayofweek'] = df_dates['timestamp'].dt.dayofweek # Monday=0, Sunday=6
df_dates['hour'] = df_dates['timestamp'].dt.hour
df_dates['is_weekend'] = df_dates['dayofweek'].isin([5, 6]).astype(int) # 1 if weekend, 0 otherwise

print(df_dates)
#            timestamp  value  year  month  dayofweek  hour  is_weekend
# 0 2023-01-15 10:00:00     10  2023      1          6    10           1
# 1 2023-01-16 18:30:00     15  2023      1          0    18           0
# 2 2023-02-20 09:15:00     12  2023      2          0     9           0

Analyzing time series plots or comparing distributions across different time components (e.g., box plots of sales per day of the week) during EDA can guide which components are worth extracting.

Binning or Discretization

Sometimes it's useful to convert a continuous numerical variable into discrete categories or bins. This can help capture non-linear effects or simplify the model. For example, instead of using exact age, you might group individuals into age_group categories like '18-25', '26-40', '41-60', '60+'.

Histograms generated during univariate analysis are often the primary motivation for binning. If the histogram shows distinct groups or if you believe the relationship with the target variable changes significantly across certain thresholds, binning can be effective. Pandas cut function is ideal for this.

# Using the previous df_poly example
age_bins = [18, 25, 40, 60, 100] # Define bin edges
age_labels = ['18-25', '26-40', '41-60', '61+'] # Define labels for bins

df_poly['age_group'] = pd.cut(df_poly['age'], bins=age_bins, labels=age_labels, right=True)

print(df_poly)
#    age  income  age_squared age_group
# 0   25      50          625     18-25
# 1   35      80         1225     26-40
# 2   45      95         2025     41-60
# 3   55     100         3025     41-60
# 4   65      90         4225       61+

You can choose between equal-width bins, equal-frequency bins (using qcut), or custom bins based on domain knowledge or EDA observations.

Combining Categories

Categorical features sometimes have many categories, some of which occur very infrequently. These rare categories might add noise rather than signal. During EDA, examining frequency counts (using .value_counts()) reveals such rare categories. You might decide to group them into a single 'Other' category.

# Sample DataFrame with a categorical feature
data_cat = {'category': ['A', 'B', 'A', 'C', 'B', 'A', 'D', 'E', 'B', 'D'],
            'value': [10, 15, 12, 20, 18, 11, 25, 30, 16, 22]}
df_cat = pd.DataFrame(data_cat)

# Identify infrequent categories (e.g., occurring less than 3 times)
counts = df_cat['category'].value_counts()
rare_categories = counts[counts < 3].index.tolist() # ['C', 'E']

# Replace rare categories with 'Other'
df_cat['category_grouped'] = df_cat['category'].replace(rare_categories, 'Other')

print(df_cat)
#   category  value category_grouped
# 0        A     10                A
# 1        B     15                B
# 2        A     12                A
# 3        C     20            Other
# 4        B     18                B
# 5        A     11                A
# 6        D     25                D
# 7        E     30            Other
# 8        B     16                B
# 9        D     22                D

print("\nNew Counts:")
print(df_cat['category_grouped'].value_counts())
# A        3
# B        3
# D        2
# Other    2
# Name: category_grouped, dtype: int64

This simplifies the feature and can sometimes improve model stability. The threshold for defining "rare" depends on the dataset size and the specific problem.

Creating new features is an iterative process. You might create a feature, visualize its relationship with other variables or the target, and then refine it or try a different approach. The techniques discussed here provide a solid foundation for translating your EDA insights into more powerful inputs for subsequent analysis or modeling stages. Remember that domain knowledge often plays a significant role in suggesting meaningful feature transformations.

Was this section helpful?