A practical example demonstrates cleaning up a small, messy dataset, preparing it for a machine learning algorithm."Imagine we have a dataset representing information about used cars, and we want to predict their price. Data often looks less than perfect, much like this sample:"Initial Car DatasetMileage (km)Age (years)Engine TypeConditionPrice ($)500003PetrolGood150001200008DieselFair8000800005PetrolGood11000NaN10DieselExcellent9500650004NaNGood1300015000012PetrolFairNaN950006HybridExcellent18000We can immediately spot a few issues algorithms won't like:Missing Values: There are NaN (Not a Number) entries in the Mileage, Engine Type, and Price columns.Mixed Data Types: We have numerical data (Mileage, Age, Price) and categorical data (Engine Type, Condition). Algorithms typically require all input features to be numerical.Different Scales: Mileage has values in the tens or hundreds of thousands, while Age is in single or double digits. This difference in scale can sometimes cause problems for certain algorithms.Let's tackle these step-by-step.Step 1: Handling Missing ValuesFirst, we need to deal with the NaN entries. As discussed earlier, common strategies include removing rows/columns or imputation (filling in missing values). Since our dataset is small, removing rows might discard too much information. Let's try imputation.Numerical Features (Mileage, Price): We can fill missing numerical values with the mean or median of the column. The median is often preferred if the data might have outliers (extreme values) that could skew the mean. Let's calculate the median for Mileage and Price from the non-missing values:Existing Mileages: 50000, 120000, 80000, 65000, 150000, 95000. Sorted: 50000, 65000, 80000, 95000, 120000, 150000. Median (average of middle two) = (80000 + 95000) / 2 = 87500.Existing Prices: 15000, 8000, 11000, 9500, 13000, 18000. Sorted: 8000, 9500, 11000, 13000, 15000, 18000. Median = (11000 + 13000) / 2 = 12000.We'll fill the missing Mileage with 87500 and the missing Price with 12000.Categorical Features (Engine Type): For categorical data, we can fill missing values with the mode (the most frequent value).Existing Engine Types: Petrol, Diesel, Petrol, Diesel, Petrol, Hybrid.Counts: Petrol (3), Diesel (2), Hybrid (1). The mode is 'Petrol'.We'll fill the missing Engine Type with 'Petrol'.Our dataset after handling missing values looks like this:Dataset After ImputationMileage (km)Age (years)Engine TypeConditionPrice ($)500003PetrolGood150001200008DieselFair8000800005PetrolGood110008750010DieselExcellent9500650004PetrolGood1300015000012PetrolFair12000950006HybridExcellent18000Note: The Price column is often our target variable (what we want to predict). Depending on the situation, if the target value is missing, we might choose to simply remove that row entirely, as imputing the target could distort the model's learning. For this example, we've imputed it to demonstrate the technique, but keep this consideration in mind.Step 2: Encoding Categorical FeaturesMachine learning algorithms work with numbers. We need to convert our categorical features (Engine Type, Condition) into a numerical format. One common method is One-Hot Encoding.Engine Type: This has categories 'Petrol', 'Diesel', 'Hybrid'. One-Hot Encoding creates a new binary (0 or 1) column for each category.Condition: This has categories 'Good', 'Fair', 'Excellent'. Again, we create a new binary column for each. Note that 'Condition' might have an inherent order (e.g., Fair < Good < Excellent). While One-Hot Encoding ignores this order, other encoding methods (like Ordinal Encoding) could preserve it if deemed important for the model. For simplicity here, we'll use One-Hot Encoding.Applying One-Hot Encoding transforms the dataset structure:Dataset After One-Hot EncodingMileage (km)Age (years)Engine_PetrolEngine_DieselEngine_HybridCond_GoodCond_FairCond_ExcellentPrice ($)5000031001001500012000080100108000800005100100110008750010010001950065000410010013000150000121000101200095000600100118000Now, all our feature columns are numerical.Step 3: Feature ScalingLook at the Mileage and Age columns. Mileage ranges from 50,000 to 150,000, while Age ranges from 3 to 12. Some algorithms are sensitive to such large differences in scale. Let's apply scaling to the numerical features (Mileage, Age). We'll use Normalization (Min-Max Scaling) which scales data to a range between 0 and 1, using the formula: $$ x' = \frac{x - \min(x)}{\max(x) - \min(x)} $$Let's apply this to Mileage and Age:Mileage: Min=50000, Max=150000. Range = 100000.50000 -> (50000 - 50000) / 100000 = 0.0120000 -> (120000 - 50000) / 100000 = 0.780000 -> (80000 - 50000) / 100000 = 0.387500 -> (87500 - 50000) / 100000 = 0.37565000 -> (65000 - 50000) / 100000 = 0.15150000 -> (150000 - 50000) / 100000 = 1.095000 -> (95000 - 50000) / 100000 = 0.45Age: Min=3, Max=12. Range = 9.3 -> (3 - 3) / 9 = 0.08 -> (8 - 3) / 9 ≈ 0.565 -> (5 - 3) / 9 ≈ 0.2210 -> (10 - 3) / 9 ≈ 0.784 -> (4 - 3) / 9 ≈ 0.1112 -> (12 - 3) / 9 = 1.06 -> (6 - 3) / 9 ≈ 0.33The Price column is our target variable. Whether you scale the target variable depends on the specific model and context. Often, it's left unscaled. We will leave it unscaled here. The one-hot encoded features are already binary (0 or 1), so they don't typically require scaling.Our final, preprocessed feature set (excluding the target Price) looks like this:Dataset After Scaling (Features Only)Mileage (Normalized)Age (Normalized)Engine_PetrolEngine_DieselEngine_HybridCond_GoodCond_FairCond_Excellent0.00.01001000.70.560100100.30.221001000.3750.780100010.150.111001001.01.01000100.450.33001001Let's visualize the effect of scaling on Mileage and Age.{"data":[{"x":[50000,120000,80000,87500,65000,150000,95000],"y":[3,8,5,10,4,12,6],"mode":"markers","type":"scatter","name":"Original","marker":{"color":"#4263eb"}},{"x":[0.0,0.7,0.3,0.375,0.15,1.0,0.45],"y":[0.0,0.5555555555555556,0.2222222222222222,0.7777777777777778,0.1111111111111111,1.0,0.3333333333333333],"mode":"markers","type":"scatter","name":"Normalized","marker":{"color":"#f76707"},"xaxis":"x2","yaxis":"y2"}],"layout":{"grid":{"rows":1,"columns":2,"pattern":"independent"},"xaxis":{"title":"Mileage (km)","domain":[0,0.45]},"yaxis":{"title":"Age (years)"},"xaxis2":{"title":"Mileage (Normalized)","domain":[0.55,1.0]},"yaxis2":{"title":"Age (Normalized)"},"title":"Effect of Normalization (Min-Max Scaling)","showlegend":true,"legend":{"x":0.5,"y":-0.15,"orientation":"h","xanchor":"center"}, "height": 450, "width": 700}}Comparison of 'Mileage' vs 'Age' before (left, blue) and after (right, orange) Normalization. Notice how the scaled features now occupy a space between 0 and 1 on both axes.Step 4: Splitting Data (The Next Step)Remember from the previous section and Chapter 2, the final step before training is typically splitting the data into a training set and a testing set. We would take our fully cleaned and prepared data (the features table above and the corresponding Price column) and divide it, usually reserving about 70-80% for training the model and 20-30% for testing its performance on unseen data.We won't perform the split here, but it's the immediate next action before feeding the data into an algorithm like Linear Regression or KNN.SummaryIn this hands-on walkthrough, we took a raw dataset with common issues and applied basic cleaning steps:Handled Missing Values: Used median imputation for numerical features and mode imputation for categorical features.Encoded Categorical Data: Transformed 'Engine Type' and 'Condition' into numerical representations using One-Hot Encoding.Scaled Numerical Features: Applied Normalization (Min-Max Scaling) to 'Mileage' and 'Age' to bring them into a common range (0 to 1).This processed data is now in a much better format for most machine learning algorithms. While these are basic techniques, they form the foundation of data preparation, a significant part of any machine learning project.