Implementing various imputation methods using Pandas and Scikit-learn on a sample dataset provides hands-on experience with Python's data science stack. The application of these methods is fundamental to preparing data for machine learning models.First, let's set up our environment by importing the necessary libraries and creating a sample DataFrame with missing values.import pandas as pd import numpy as np from sklearn.impute import SimpleImputer, KNNImputer, MissingIndicator from sklearn.experimental import enable_iterative_imputer # Enable IterativeImputer from sklearn.impute import IterativeImputer import matplotlib.pyplot as plt import seaborn as sns # Create a sample DataFrame with missing values data = { 'Age': [25, 30, np.nan, 35, 40, 45, 50, np.nan, 55], 'Salary': [50000, 60000, 75000, np.nan, 80000, 90000, 110000, 65000, np.nan], 'Experience': [1, 5, 3, 10, 15, 20, np.nan, 8, 30], 'Department': ['HR', 'IT', 'Finance', 'IT', np.nan, 'HR', 'Finance', 'IT', 'Finance'], 'Rating': [3.5, 4.0, 4.5, 3.0, np.nan, 4.2, 3.8, 4.8, 3.9] } df = pd.DataFrame(data) print("Original DataFrame with Missing Values:") print(df) print("\nMissing values per column:") print(df.isnull().sum())Our sample df contains missing values (np.nan) in numerical (Age, Salary, Experience, Rating) and categorical (Department) columns.Simple Imputation StrategiesAs discussed earlier, simple imputation involves replacing missing values using basic statistical measures. Scikit-learn's SimpleImputer is a convenient tool for this.Mean and Median Imputation (Numerical Features)We typically use the mean for normally distributed data and the median for skewed data or data with outliers. Let's apply both to different columns for demonstration.# Impute 'Age' with mean mean_imputer = SimpleImputer(strategy='mean') # Reshape is needed as SimpleImputer expects 2D array df['Age_mean_imputed'] = mean_imputer.fit_transform(df[['Age']]) # Impute 'Salary' with median (often better for salary data) median_imputer = SimpleImputer(strategy='median') df['Salary_median_imputed'] = median_imputer.fit_transform(df[['Salary']]) # Impute 'Experience' and 'Rating' together with median num_cols_median = ['Experience', 'Rating'] median_imputer_multi = SimpleImputer(strategy='median') # Fit on the original columns median_imputer_multi.fit(df[num_cols_median]) # Transform and create new columns df[['Experience_median_imputed', 'Rating_median_imputed']] = median_imputer_multi.transform(df[num_cols_median]) print("\nDataFrame after Mean/Median Imputation:") print(df[['Age', 'Age_mean_imputed', 'Salary', 'Salary_median_imputed', 'Experience', 'Experience_median_imputed', 'Rating', 'Rating_median_imputed']].head())Observe how NaN values in the original columns are replaced by the calculated mean or median in the corresponding new columns.Mode Imputation (Categorical Features)For categorical features like Department, the most frequent value (mode) is commonly used for imputation.# Impute 'Department' with mode mode_imputer = SimpleImputer(strategy='most_frequent') df['Department_mode_imputed'] = mode_imputer.fit_transform(df[['Department']]) print("\nDataFrame after Mode Imputation:") print(df[['Department', 'Department_mode_imputed']].head(6)) # Show row with original NaNThe missing department is filled with the most common department found in the column.Creating Missing Value IndicatorsSometimes, the fact that a value was missing is informative in itself. We can capture this using indicator features. SimpleImputer can do this automatically, or we can use MissingIndicator.# Using SimpleImputer with add_indicator=True median_imputer_indicator = SimpleImputer(strategy='median', add_indicator=True) imputed_with_indicator = median_imputer_indicator.fit_transform(df[['Salary']]) # Use original Salary # The output is a NumPy array: column 0 is imputed data, column 1 is the indicator df['Salary_median_imputed_si'] = imputed_with_indicator[:, 0] df['Salary_missing_indicator_si'] = imputed_with_indicator[:, 1].astype(int) # Convert boolean to int # Using MissingIndicator directly indicator = MissingIndicator(features='all') # Check all features missing_indicators = indicator.fit_transform(df[['Age', 'Salary', 'Experience', 'Department', 'Rating']]) # Convert to DataFrame for clarity indicator_df = pd.DataFrame(missing_indicators, columns=[f'{col}_missing' for col in df.columns if df[col].isnull().any()], index=df.index) # Combine with original df (optional, for viewing) df_with_indicators = pd.concat([df, indicator_df], axis=1) print("\nDataFrame with Salary Imputation and Indicator (from SimpleImputer):") print(df[['Salary', 'Salary_median_imputed_si', 'Salary_missing_indicator_si']].head()) print("\nDataFrame showing all generated Missing Indicators (from MissingIndicator):") print(df_with_indicators[['Age', 'Age_missing', 'Salary', 'Salary_missing', 'Experience', 'Experience_missing', 'Department', 'Department_missing', 'Rating', 'Rating_missing']].head(6))These binary indicator columns explicitly signal where data was originally missing, which might be useful for certain models.Multivariate Imputation TechniquesMultivariate methods use information from other features to estimate missing values, potentially leading to more accurate imputations than simple strategies.KNN ImputerKNNImputer fills missing values using the average value from the $k$ nearest neighbors found in the training set. Neighbors are identified based on the features that are not missing. This requires all features used for imputation to be numerical. We'll first need to encode the Department column (e.g., using one-hot encoding, covered in the next chapter) or exclude it. For simplicity here, let's impute only the numerical features together.from sklearn.preprocessing import MinMaxScaler # KNNImputer is sensitive to feature scaling, so scale first numerical_cols = ['Age', 'Salary', 'Experience', 'Rating'] df_numerical = df[numerical_cols].copy() scaler = MinMaxScaler() df_scaled = pd.DataFrame(scaler.fit_transform(df_numerical), columns=numerical_cols) # Apply KNNImputer knn_imputer = KNNImputer(n_neighbors=3) # Use 3 neighbors df_knn_imputed_scaled = pd.DataFrame(knn_imputer.fit_transform(df_scaled), columns=numerical_cols) # Inverse transform to get data back to original scale df_knn_imputed = pd.DataFrame(scaler.inverse_transform(df_knn_imputed_scaled), columns=numerical_cols) # Add imputed columns back to original df for comparison (optional) df['Age_knn_imputed'] = df_knn_imputed['Age'] df['Salary_knn_imputed'] = df_knn_imputed['Salary'] df['Experience_knn_imputed'] = df_knn_imputed['Experience'] df['Rating_knn_imputed'] = df_knn_imputed['Rating'] print("\nDataFrame after KNN Imputation (showing original and imputed side-by-side):") # Display rows where original data was missing to see the imputed values missing_rows_idx = df[df[numerical_cols].isnull().any(axis=1)].index print(df.loc[missing_rows_idx, ['Age', 'Age_knn_imputed', 'Salary', 'Salary_knn_imputed', 'Experience', 'Experience_knn_imputed', 'Rating', 'Rating_knn_imputed']]) Note that KNN Imputation requires careful consideration of the number of neighbors ($k$) and the distance metric used. Scaling features beforehand is generally recommended.Iterative ImputerIterativeImputer models each feature with missing values as a function of other features and uses an iterative approach to estimate the missing values. It cycles through predicting missing values for each feature based on all others until the estimates stabilize.# IterativeImputer also works better with scaled data usually # We can reuse the scaled data from the KNN example iterative_imputer = IterativeImputer(max_iter=10, random_state=0) # max_iter controls iterations df_iterative_imputed_scaled = pd.DataFrame(iterative_imputer.fit_transform(df_scaled), columns=numerical_cols) # Inverse transform df_iterative_imputed = pd.DataFrame(scaler.inverse_transform(df_iterative_imputed_scaled), columns=numerical_cols) # Add imputed columns back to original df df['Age_iterative_imputed'] = df_iterative_imputed['Age'] df['Salary_iterative_imputed'] = df_iterative_imputed['Salary'] df['Experience_iterative_imputed'] = df_iterative_imputed['Experience'] df['Rating_iterative_imputed'] = df_iterative_imputed['Rating'] print("\nDataFrame after Iterative Imputation (showing original and imputed side-by-side):") print(df.loc[missing_rows_idx, ['Age', 'Age_iterative_imputed', 'Salary', 'Salary_iterative_imputed', 'Experience', 'Experience_iterative_imputed', 'Rating', 'Rating_iterative_imputed']])IterativeImputer is often more sophisticated but can be computationally more intensive than KNNImputer.Comparing Imputation MethodsThe choice of imputation method depends on the data characteristics, the mechanism of missingness (if known), and the specific requirements of the machine learning model.Let's visualize the distribution of the 'Salary' feature before and after different imputations to see the impact.# Prepare data for plotting salary_data = pd.DataFrame({ 'Original': df['Salary'], 'Median Imputed': df['Salary_median_imputed'], 'KNN Imputed': df['Salary_knn_imputed'], 'Iterative Imputed': df['Salary_iterative_imputed'] }) # Melt the DataFrame for Seaborn plotting salary_melted = salary_data.melt(var_name='Imputation Method', value_name='Salary') # Create the plot plt.figure(figsize=(12, 6)) sns.kdeplot(data=salary_melted, x='Salary', hue='Imputation Method', fill=True, common_norm=False, palette="viridis") plt.title('Distribution of Salary After Different Imputation Methods') plt.xlabel('Salary') plt.ylabel('Density') plt.show(){"data":[{"type":"violin","x":"Original","y":[50000.0,60000.0,75000.0,null,80000.0,90000.0,110000.0,65000.0,null],"name":"Original","box":{"visible":true},"meanline":{"visible":true},"marker":{"color":"#4263eb"}},{"type":"violin","x":"Median Imputed","y":[50000.0,60000.0,75000.0,77500.0,80000.0,90000.0,110000.0,65000.0,77500.0],"name":"Median Imputed","box":{"visible":true},"meanline":{"visible":true},"marker":{"color":"#12b886"}},{"type":"violin","x":"KNN Imputed","y":[50000.0,60000.0,75000.0,65000.0,80000.0,90000.0,110000.0,65000.0,96666.66666666667],"name":"KNN Imputed","box":{"visible":true},"meanline":{"visible":true},"marker":{"color":"#f59f00"}},{"type":"violin","x":"Iterative Imputed","y":[50000.0,60000.0,75000.0,67871.2608664374,80000.0,90000.0,110000.0,65000.0,97211.7262528757],"name":"Iterative Imputed","box":{"visible":true},"meanline":{"visible":true},"marker":{"color":"#f03e3e"}}],"layout":{"title":{"text":"Salary Distribution Comparison"},"yaxis":{"title":"Salary","zeroline":false},"violingap":0.3,"violinmode":"group","height":450,"width":700,"showlegend":false,"margin":{"l":50,"r":30,"t":50,"b":50}}}Comparison of Salary distributions using violin plots after different imputation methods. Original data includes nulls. Median imputation adds points at the median value. KNN and Iterative imputation provide potentially more detailed estimates based on other features.Simple Imputation (Mean/Median/Mode): Fast, easy to implement. Doesn't use relationships between features. Can distort variance and correlations. Mean is sensitive to outliers, Median is more robust. Mode is suitable for categorical data.Indicator Features: Retains information about missingness. Can be used alongside any imputation method.KNN Imputer: Considers feature relationships. More computationally expensive than simple methods. Sensitive to scaling and the choice of $k$. Requires numerical data.Iterative Imputer: Often provides accurate imputations by modeling features. Can handle different data types if the underlying estimator supports them (though the default BayesianRidge works on numerical). Can be computationally intensive. Sensitive to scaling.The best approach often involves experimentation and evaluating the impact on downstream model performance. Consider the trade-offs between imputation accuracy, computational cost, and the potential distortions introduced into your dataset. Remember to fit imputers only on the training data and use the fitted imputer to transform both training and testing datasets to prevent data leakage. This is often best managed using Scikit-learn Pipelines.