You've already learned how to clean, impute, encode, and scale the features provided in your raw dataset. These are essential steps for preparing data for machine learning algorithms. However, sometimes the original features, even after careful preparation, don't fully capture the underlying patterns or relationships needed for optimal model performance. This is where feature creation comes into play.
Think of the features you start with as the basic building blocks. Feature creation is the process of intelligently combining or transforming these blocks to construct new, more informative features. Why bother? Because the goal is to make the learning task easier for your model.
Machine learning models learn patterns from the data they are given. Some models, like linear regression or logistic regression, are inherently good at finding linear relationships. However, they might struggle if the true relationship between features and the target variable is more complex or non-linear.
Consider predicting house prices. A simple linear model might use square_footage
and number_of_bedrooms
as separate inputs. But maybe the value per square foot changes significantly depending on the size of the house. A large house might have a lower price per square foot than a small, efficiently designed one. The raw features square_footage
and number_of_bedrooms
don't directly represent this interaction. By creating a new feature, such as price_per_square_foot
(if price is known during training, though often it's the target) or an interaction term like square_footage * number_of_bedrooms
, we make this potential relationship explicit, potentially helping the model learn better.
Similarly, if a relationship follows a curve, like y≈x2, a linear model using only x as a feature will provide a poor fit. Creating a new feature, xsquared=x2, allows the linear model to fit the quadratic relationship effectively.
Flow showing how engineered features can provide more direct inputs to a machine learning model compared to relying solely on raw features.
Often, you possess knowledge about the problem domain that isn't directly encoded in the raw data variables. Feature creation is a primary way to inject this understanding into the modeling process.
For example, if you have transaction_timestamp
data, the raw timestamp might not be the most useful format for many models. Domain knowledge suggests that factors like the day_of_week
, month
, or hour_of_day
could be significant predictors of customer behavior. Extracting these components creates new features that explicitly represent temporal patterns potentially relevant to the prediction task. Likewise, converting a date_of_birth
feature into age
is a common, domain-driven feature creation step.
Sometimes, creating the right features allows you to use simpler, more interpretable models effectively. Instead of forcing a complex model (like a deep neural network or a very deep tree) to learn intricate interactions from raw features, you can engineer features that capture these interactions beforehand. This might allow a simpler model, like regularized linear regression, to achieve comparable or even better performance, often with the added benefits of faster training and easier interpretation.
In this chapter, we will look at several systematic ways to create new features:
By mastering these techniques, you can often significantly enhance your model's predictive accuracy and its ability to generalize to new, unseen data. Let's begin by exploring how to generate interaction features.
© 2025 ApX Machine Learning