As mentioned in the chapter introduction, feature engineering sits critically within the machine learning workflow. But what exactly is a feature in this context?
At its core, a feature is an individual measurable property or characteristic of the phenomenon being observed. Think of it as an input variable used by a machine learning model to make predictions or decisions. In the structured datasets typically used for machine learning (like tables or spreadsheets), features often correspond to the columns.
Consider a simple example: predicting house prices. The raw data you collect might include various pieces of information about each house. Some of this raw data might be directly usable as features, while other features might need to be created or transformed.
age
feature from the year_built
data. Similarly, having the lot size and the house square footage might allow you to create a yard_size
feature (lot size - house square footage) or a house_ratio
feature (house square footage / lot size).The key idea is that features are the informative representations of raw data tailored for the learning algorithm. Raw data is often messy, contains irrelevant information, or isn't in a format suitable for algorithms (e.g., text addresses, raw timestamps). Features are the refined, numerical or categorical inputs derived from this raw data.
Here's a small table illustrating the distinction:
Raw Data Point | Potential Feature(s) Derived | Feature Type |
---|---|---|
transaction_timestamp |
hour_of_day , day_of_week , is_weekend (binary) |
Numerical, Numerical, Categorical |
customer_address |
zip_code , distance_to_store (miles) |
Categorical, Numerical |
product_description (text) |
keyword_count , sentiment_score |
Numerical, Numerical |
sale_price , cost_price |
profit_margin ((sale_price−cost_price)/sale_price) |
Numerical |
number_of_bedrooms |
number_of_bedrooms |
Numerical |
color (e.g., "Red", "Blue") |
color_encoded (e.g., 0, 1 using encoding) |
Categorical/Numerical |
Example transformations from raw data points into potential machine learning features.
The goal of feature engineering, which we will explore throughout this course, is to construct the most effective set of features from the available data. This involves selecting relevant information, transforming it into suitable formats (like converting categories to numbers), handling missing values, and sometimes creating entirely new features that capture underlying patterns more effectively than the raw data alone. The quality of these features is often more important for model performance than the choice of the model algorithm itself. Understanding what constitutes a good feature is the first step in this process.
© 2025 ApX Machine Learning