Alright, let's get our hands dirty! Theory is helpful, but applying these data preparation techniques makes them stick. In this section, we'll walk through a practical example of cleaning up a small, messy dataset to get it ready for a machine learning algorithm.
Imagine we have a dataset representing information about used cars, and we want to predict their price. Real-world data often looks less than perfect, much like this sample:
Initial Car Dataset
Mileage (km) | Age (years) | Engine Type | Condition | Price ($) |
---|---|---|---|---|
50000 | 3 | Petrol | Good | 15000 |
120000 | 8 | Diesel | Fair | 8000 |
80000 | 5 | Petrol | Good | 11000 |
NaN | 10 | Diesel | Excellent | 9500 |
65000 | 4 | NaN | Good | 13000 |
150000 | 12 | Petrol | Fair | NaN |
95000 | 6 | Hybrid | Excellent | 18000 |
We can immediately spot a few issues algorithms won't like:
NaN
(Not a Number) entries in the Mileage
, Engine Type
, and Price
columns.Mileage
, Age
, Price
) and categorical data (Engine Type
, Condition
). Algorithms typically require all input features to be numerical.Mileage
has values in the tens or hundreds of thousands, while Age
is in single or double digits. This difference in scale can sometimes cause problems for certain algorithms.Let's tackle these step-by-step.
First, we need to deal with the NaN
entries. As discussed earlier, common strategies include removing rows/columns or imputation (filling in missing values). Since our dataset is small, removing rows might discard too much information. Let's try imputation.
Numerical Features (Mileage
, Price
): We can fill missing numerical values with the mean or median of the column. The median is often preferred if the data might have outliers (extreme values) that could skew the mean. Let's calculate the median for Mileage
and Price
from the non-missing values:
Mileage
with 87500 and the missing Price
with 12000.Categorical Features (Engine Type
): For categorical data, we can fill missing values with the mode (the most frequent value).
Engine Type
with 'Petrol'.Our dataset after handling missing values looks like this:
Dataset After Imputation
Mileage (km) | Age (years) | Engine Type | Condition | Price ($) |
---|---|---|---|---|
50000 | 3 | Petrol | Good | 15000 |
120000 | 8 | Diesel | Fair | 8000 |
80000 | 5 | Petrol | Good | 11000 |
87500 | 10 | Diesel | Excellent | 9500 |
65000 | 4 | Petrol | Good | 13000 |
150000 | 12 | Petrol | Fair | 12000 |
95000 | 6 | Hybrid | Excellent | 18000 |
Note: The Price
column is often our target variable (what we want to predict). Depending on the situation, if the target value is missing, we might choose to simply remove that row entirely, as imputing the target could distort the model's learning. For this example, we've imputed it to demonstrate the technique, but keep this consideration in mind.
Machine learning algorithms work with numbers. We need to convert our categorical features (Engine Type
, Condition
) into a numerical format. One common method is One-Hot Encoding.
Engine Type: This has categories 'Petrol', 'Diesel', 'Hybrid'. One-Hot Encoding creates a new binary (0 or 1) column for each category.
Condition: This has categories 'Good', 'Fair', 'Excellent'. Again, we create a new binary column for each. Note that 'Condition' might have an inherent order (e.g., Fair < Good < Excellent). While One-Hot Encoding ignores this order, other encoding methods (like Ordinal Encoding) could preserve it if deemed important for the model. For simplicity here, we'll use One-Hot Encoding.
Applying One-Hot Encoding transforms the dataset structure:
Dataset After One-Hot Encoding (Conceptual)
Mileage (km) | Age (years) | Engine_Petrol | Engine_Diesel | Engine_Hybrid | Cond_Good | Cond_Fair | Cond_Excellent | Price ($) |
---|---|---|---|---|---|---|---|---|
50000 | 3 | 1 | 0 | 0 | 1 | 0 | 0 | 15000 |
120000 | 8 | 0 | 1 | 0 | 0 | 1 | 0 | 8000 |
80000 | 5 | 1 | 0 | 0 | 1 | 0 | 0 | 11000 |
87500 | 10 | 0 | 1 | 0 | 0 | 0 | 1 | 9500 |
65000 | 4 | 1 | 0 | 0 | 1 | 0 | 0 | 13000 |
150000 | 12 | 1 | 0 | 0 | 0 | 1 | 0 | 12000 |
95000 | 6 | 0 | 0 | 1 | 0 | 0 | 1 | 18000 |
Now, all our feature columns are numerical.
Look at the Mileage
and Age
columns. Mileage
ranges from 50,000 to 150,000, while Age
ranges from 3 to 12. Some algorithms are sensitive to such large differences in scale. Let's apply scaling to the numerical features (Mileage
, Age
). We'll use Normalization (Min-Max Scaling) which scales data to a range between 0 and 1, using the formula:
Let's apply this to Mileage
and Age
:
The Price
column is our target variable. Whether you scale the target variable depends on the specific model and context. Often, it's left unscaled. We will leave it unscaled here. The one-hot encoded features are already binary (0 or 1), so they don't typically require scaling.
Our final, preprocessed feature set (excluding the target Price
) looks conceptually like this:
Dataset After Scaling (Features Only)
Mileage (Normalized) | Age (Normalized) | Engine_Petrol | Engine_Diesel | Engine_Hybrid | Cond_Good | Cond_Fair | Cond_Excellent |
---|---|---|---|---|---|---|---|
0.0 | 0.0 | 1 | 0 | 0 | 1 | 0 | 0 |
0.7 | 0.56 | 0 | 1 | 0 | 0 | 1 | 0 |
0.3 | 0.22 | 1 | 0 | 0 | 1 | 0 | 0 |
0.375 | 0.78 | 0 | 1 | 0 | 0 | 0 | 1 |
0.15 | 0.11 | 1 | 0 | 0 | 1 | 0 | 0 |
1.0 | 1.0 | 1 | 0 | 0 | 0 | 1 | 0 |
0.45 | 0.33 | 0 | 0 | 1 | 0 | 0 | 1 |
Let's visualize the effect of scaling on Mileage
and Age
.
Comparison of 'Mileage' vs 'Age' before (left, blue) and after (right, orange) Normalization. Notice how the scaled features now occupy a space between 0 and 1 on both axes.
Remember from the previous section and Chapter 2, the final step before training is typically splitting the data into a training set and a testing set. We would take our fully cleaned and prepared data (the features table above and the corresponding Price
column) and divide it, usually reserving about 70-80% for training the model and 20-30% for testing its performance on unseen data.
We won't perform the split here, but it's the immediate next action before feeding the data into an algorithm like Linear Regression or KNN.
In this hands-on walkthrough, we took a raw dataset with common issues and applied basic cleaning steps:
This processed data is now in a much better format for most machine learning algorithms. While these are basic techniques, they form the foundation of data preparation, a significant part of any machine learning project.
© 2025 ApX Machine Learning