Let's put the theory into practice. Knowing how to identify and handle missing data is fundamental, and the best way to learn is by doing. We'll use Python with the popular Pandas library, a standard tool in data science, to work through a small example dataset. If you haven't used Pandas before, don't worry. We'll guide you step-by-step.First, imagine we have collected some data about students, but some information is missing. Let's create this data in a Pandas DataFrame. We use numpy.nan to represent the missing values.import pandas as pd import numpy as np # Sample data with missing values data = { 'StudentID': [101, 102, 103, 104, 105, 106], 'Score': [85, np.nan, 70, 95, 88, 78], 'Grade': ['B', 'C', np.nan, 'A', 'B', np.nan], 'Attendance': [90, 80, 85, np.nan, 95, 88] } df = pd.DataFrame(data) print("Original DataFrame:") print(df)This code creates a DataFrame named df. Printing it shows us the data, including the NaN entries representing missing information.Detecting Missing ValuesAs discussed earlier, the first step is to find out where and how much data is missing. We can use the isnull() method combined with sum() in Pandas. isnull() returns a DataFrame of boolean values (True if missing, False otherwise), and sum() then counts the True values per column.# Count missing values in each column missing_counts = df.isnull().sum() print("\nMissing Value Counts:") print(missing_counts)The output will show: StudentID 0 Score 1 Grade 2 Attendance 1 dtype: int64This tells us we have one missing Score, two missing Grade entries, and one missing Attendance percentage. StudentID has no missing values.Visualizing Missing DataSometimes, a visual inspection helps understand the extent of missing data, especially in larger datasets. A simple bar chart showing the count of missing values per column can be effective.{"layout": {"title": "Count of Missing Values per Column", "xaxis": {"title": "Column Name"}, "yaxis": {"title": "Number of Missing Values"}, "template": "simple_white", "bargap": 0.2}, "data": [{"type": "bar", "x": ["Score", "Grade", "Attendance"], "y": [1, 2, 1], "marker": {"color": ["#339af0", "#748ffc", "#339af0"]}, "name": "Missing Count"}]}Bar chart displaying the number of missing entries found in the 'Score', 'Grade', and 'Attendance' columns.This visualization confirms our findings from isnull().sum(): 'Grade' has the most missing values in this small dataset.Strategy 1: Deleting Rows (Listwise Deletion)One way to handle missing data is to remove any row containing at least one missing value. This is called listwise deletion. Pandas provides the dropna() method for this.# Create a copy to avoid modifying the original DataFrame df_dropped_rows = df.copy() # Drop rows with any NaN values df_dropped_rows.dropna(inplace=True) # inplace=True modifies the DataFrame directly print("\nDataFrame after dropping rows with missing values:") print(df_dropped_rows) print(f"\nOriginal shape: {df.shape}") print(f"Shape after dropping rows: {df_dropped_rows.shape}")You'll notice that rows with index 1, 2, and 3 (corresponding to StudentIDs 102, 103, and 104) have been removed because they each had at least one NaN. The resulting DataFrame df_dropped_rows only contains complete records.Consideration: This approach is simple but can lead to significant data loss, especially if missing values are widespread. Losing three out of six students (50% of the data) is substantial. This is generally only advisable if the number of rows with missing data is very small compared to the total dataset size.Strategy 2: Deleting ColumnsIf a column has a very large proportion of missing values, it might provide little useful information and could be dropped entirely. We can use dropna() again, but specify axis=1 to target columns. We can also set a thresh (threshold) parameter, which specifies the minimum number of non-missing values required for a column to be kept.Let's say we decide to drop any column that has fewer than 5 non-missing values (our DataFrame has 6 rows total).# Create another copy df_dropped_cols = df.copy() # Drop columns with less than 5 non-missing values threshold = 5 df_dropped_cols.dropna(axis=1, thresh=threshold, inplace=True) print("\nDataFrame after dropping columns with many missing values:") print(df_dropped_cols) print(f"\nOriginal shape: {df.shape}") print(f"Shape after dropping columns: {df_dropped_cols.shape}")In our example, 'Grade' has only 4 non-missing values (6 total rows - 2 missing = 4). Since 4 is less than our threshold of 5, the 'Grade' column is removed. The 'Score' and 'Attendance' columns each have 5 non-missing values, meeting the threshold, so they are kept.Consideration: Removing columns means losing that feature entirely. Only do this if the column is deemed non-essential or if the proportion of missing data is so high that imputation would be unreliable.Strategy 3: Basic ImputationInstead of deleting data, we can fill the gaps. This is called imputation. Simple methods involve replacing missing values with the mean, median, or mode of the respective column.Mean/Median: Typically used for numerical data. The median is often preferred over the mean if the data has outliers, as it's less sensitive to extreme values.Mode: Typically used for categorical data (like 'Grade'). The mode is the most frequent value in the column.Let's impute the missing values in our original df.# Create a copy for imputation df_imputed = df.copy() # 1. Impute 'Score' (numerical) with the median median_score = df_imputed['Score'].median() print(f"\nMedian score for imputation: {median_score}") df_imputed['Score'].fillna(median_score, inplace=True) # 2. Impute 'Attendance' (numerical) with the mean mean_attendance = df_imputed['Attendance'].mean() print(f"Mean attendance for imputation: {mean_attendance:.2f}") # Format to 2 decimal places df_imputed['Attendance'].fillna(mean_attendance, inplace=True) # 3. Impute 'Grade' (categorical) with the mode # Note: mode() can return multiple values if they have the same frequency. We take the first one [0]. mode_grade = df_imputed['Grade'].mode()[0] print(f"Mode grade for imputation: {mode_grade}") df_imputed['Grade'].fillna(mode_grade, inplace=True) print("\nDataFrame after imputation:") print(df_imputed) # Verify that no missing values remain print("\nMissing values count after imputation:") print(df_imputed.isnull().sum())The output shows the calculated median score, mean attendance, and mode grade. The final imputed DataFrame has all NaN values replaced:The missing Score (for StudentID 102) is replaced by the median score (which is 86.5, calculated from [85, 70, 95, 88, 78]).The missing Attendance (for StudentID 104) is replaced by the mean attendance (which is 87.6, calculated from [90, 80, 85, 95, 88]).The two missing Grade values (for StudentIDs 103 and 106) are replaced by the mode grade ('B', as it appears most frequently).Finally, checking isnull().sum() confirms that there are no missing values left in df_imputed.Consideration: Simple imputation preserves the dataset size but can distort relationships between variables and reduce variance (especially mean imputation). Using the median for skewed numerical data and the mode for categorical data are common starting points.Choosing the Right StrategyAs you've seen, there's no single "best" way to handle missing data. The choice depends on:Amount of Missing Data: Small amounts might be deletable (rows); large amounts in a column might warrant column deletion or require more sophisticated imputation.Nature of the Data: Is it numerical or categorical? Are there outliers?Underlying Cause (if known): Why is the data missing? Randomly, or systematically? (Systematic missingness often requires more careful thought).Goal of Analysis: Will simple imputation affect your model's performance or analytical conclusions?In this practical exercise, you've applied the fundamental techniques: detecting, visualizing, deleting (rows/columns), and basic imputing (mean/median/mode). These methods form the foundation for preparing data for analysis. As you progress, you'll encounter more advanced imputation techniques, but mastering these basics is an essential first step.