Sometimes, missing data isn't just scattered across a few rows; it can heavily affect entire columns (features) in your dataset. In the previous section, we looked at removing rows with missing values. Now, we'll consider a different, sometimes necessary, approach: removing entire columns.
This strategy is usually reserved for situations where a specific column has a very high percentage of missing entries. Why? Because if most of the data in a column is missing, that feature might not provide much useful information for your analysis or machine learning model. Trying to fill in a vast number of missing values (imputation, which we'll cover next) in such a column could introduce significant bias or noise, potentially doing more harm than good.
There isn't a strict rule, but a common practice is to consider dropping a column if a large fraction of its values are missing. What constitutes "large"? This often depends on the context, the size of your dataset, and the importance of the feature. Some analysts use thresholds like 50%, 60%, or even 70% missing values as a starting point for considering column deletion.
Imagine you have a dataset with information about customers, including a column for 'Fax Number'. In today's world, very few customers might provide this, leading to perhaps 95% missing values in that column. For most analyses (like predicting purchase behavior), this column is unlikely to be helpful and could be a good candidate for removal.
Let's visualize this. Suppose we calculate the percentage of missing values for several columns in a dataset:
In this example, the 'Last_Login_Device' column has 85% missing data. This high percentage makes it a strong candidate for removal.
Most data analysis libraries provide straightforward functions to drop columns. For instance, using the popular pandas library in Python, you would typically identify the names of the columns you wish to remove and use a function like drop()
specifying the column names and indicating that you are dropping columns (as opposed to rows).
# Example using pandas (Python)
# Assume 'df' is your DataFrame and you want to drop 'Column_X'
# Calculate missing percentages (conceptual)
missing_percentages = df.isnull().mean() * 100
# Identify columns to drop (e.g., threshold of 70%)
columns_to_drop = missing_percentages[missing_percentages > 70].index
# Drop the identified columns
df_cleaned = df.drop(columns=columns_to_drop)
print(f"Original columns: {df.columns.tolist()}")
print(f"Columns after dropping: {df_cleaned.columns.tolist()}")
Deleting columns is a trade-off. You simplify the dataset but risk losing potentially useful information. Before dropping a column:
Choosing to delete a column is a more drastic step than deleting individual rows. It's often employed when the proportion of missing data in a feature is so substantial that the feature itself is unlikely to contribute meaningfully to the analysis, or when imputation seems too unreliable.
© 2025 ApX Machine Learning