Let's put the techniques discussed in this chapter into practice. We'll work through a common workflow: identifying issues in a raw dataset, cleaning them up, and modifying the structure to make it more suitable for analysis.
First, we need some data to work with. We'll create a small DataFrame representing product information, intentionally including some common problems like missing values and inconsistent naming. Make sure you have Pandas imported, typically using import pandas as pd
and NumPy using import numpy as np
.
import pandas as pd
import numpy as np
# Create the initial DataFrame
data = {'ProductID': ['P101', 'P102', 'P103', 'P104', 'P105', 'P106'],
'Category': ['A', 'B', 'A', 'C', 'B', np.nan],
'Sales': [150, 200, np.nan, 300, 250, 180],
'Inventory Count': [25, 0, 10, 5, np.nan, 15],
'Region': ['North', 'South', 'North', 'West', 'South', 'East']}
df_store = pd.DataFrame(data)
print("Original DataFrame:")
print(df_store)
This DataFrame has information about products, including their category, sales figures, inventory count, and region. Notice the np.nan
values, which represent missing data.
Before we can handle missing data, we need to find it. The isnull()
method returns a boolean DataFrame indicating True
where data is missing. Combining it with sum()
gives us a count of missing values per column.
# Check for missing values
print("\nMissing values per column:")
print(df_store.isnull().sum())
Output:
Missing values per column:
ProductID 0
Category 1
Sales 1
Inventory Count 1
Region 0
dtype: int64
This tells us we have one missing value each in the Category
, Sales
, and Inventory Count
columns.
We have several options for dealing with NaN
values.
If we decide that any row with missing data is unusable, we can drop those rows using dropna()
.
# Drop rows containing any NaN values
df_dropped_rows = df_store.dropna()
print("\nDataFrame after dropping rows with NaN:")
print(df_dropped_rows)
Notice that rows with index 2, 4, and 5 (where Sales
, Inventory Count
, and Category
were missing, respectively) have been removed. Be careful with this approach, as dropping rows can lead to significant data loss if missing values are widespread.
Often, it's better to fill, or impute, missing values. We can fill with a fixed value or a calculated one.
Let's work with the original df_store
again. We might decide to fill the missing Category
with a placeholder like 'Unknown' and the missing Sales
and Inventory Count
with the mean of their respective columns.
# Create a copy to avoid modifying the original df_store directly in this step
df_filled = df_store.copy()
# Fill missing Category with 'Unknown'
df_filled['Category'].fillna('Unknown', inplace=True)
# Calculate mean sales (excluding NaN) and fill missing Sales
mean_sales = df_filled['Sales'].mean()
df_filled['Sales'].fillna(mean_sales, inplace=True)
# Calculate mean inventory (excluding NaN) and fill missing Inventory Count
mean_inventory = df_filled['Inventory Count'].mean()
df_filled['Inventory Count'].fillna(mean_inventory, inplace=True)
print("\nDataFrame after filling NaN values:")
print(df_filled)
print("\nMissing values after filling:")
print(df_filled.isnull().sum())
The inplace=True
argument modifies the DataFrame directly. Now, all missing values have been replaced. Filling with the mean is a common technique, but the best strategy depends on the specific data and context. Sometimes filling with the median or mode might be more appropriate, especially if the data has outliers or is categorical.
Sometimes, entire columns or specific rows are irrelevant or problematic. Let's assume the Region
column isn't needed for our analysis in the df_filled
DataFrame.
# Drop the 'Region' column
df_no_region = df_filled.drop(columns=['Region'])
print("\nDataFrame after dropping the 'Region' column:")
print(df_no_region)
We can also drop rows by their index label. Let's say product 'P104' (index 3) is discontinued and should be removed.
# Drop the row with index 3
df_dropped_row_3 = df_no_region.drop(index=3)
print("\nDataFrame after dropping row with index 3:")
print(df_dropped_row_3)
We often need to create new columns based on existing data or modify existing ones.
Let's add a 'SalesPerInventory' column, calculated by dividing 'Sales' by 'Inventory Count'. We need to handle potential division by zero if inventory is 0.
# Use the DataFrame before dropping row 3 for this example
df_to_modify = df_no_region.copy()
# Add a new column 'SalesPerInventory'
# Replace 0 inventory with a small number (or NaN) to avoid division by zero error
df_to_modify['SalesPerInventory'] = df_to_modify['Sales'] / df_to_modify['Inventory Count'].replace(0, np.nan)
# Fill any resulting NaN (from division by NaN or zero) with 0, assuming 0 ratio in those cases
df_to_modify['SalesPerInventory'].fillna(0, inplace=True)
print("\nDataFrame with added 'SalesPerInventory' column:")
print(df_to_modify)
Suppose we want to convert the 'ProductID' column to uppercase for consistency.
# Convert 'ProductID' to uppercase
df_to_modify['ProductID'] = df_to_modify['ProductID'].str.upper()
print("\nDataFrame with 'ProductID' converted to uppercase:")
print(df_to_modify)
Here, we used the .str
accessor to apply the string method upper()
to the 'ProductID' column.
Column names might be unclear or contain characters (like spaces) that are inconvenient for coding. Let's rename 'Inventory Count' to something simpler like 'Stock'.
# Rename 'Inventory Count' column
df_renamed = df_to_modify.rename(columns={'Inventory Count': 'Stock'})
print("\nDataFrame with renamed column:")
print(df_renamed)
Organizing data by sorting can make it easier to understand.
Let's sort the DataFrame by 'Sales' in descending order.
# Sort by 'Sales' descending
df_sorted_sales = df_renamed.sort_values(by='Sales', ascending=False)
print("\nDataFrame sorted by Sales (descending):")
print(df_sorted_sales)
We can also sort by multiple columns. Let's sort by 'Category' (ascending) and then by 'Stock' (descending) within each category.
# Sort by 'Category' (ascending) then 'Stock' (descending)
df_sorted_multi = df_renamed.sort_values(by=['Category', 'Stock'], ascending=[True, False])
print("\nDataFrame sorted by Category (asc) and Stock (desc):")
print(df_sorted_multi)
If needed, you can also sort by the DataFrame's index using sort_index()
.
# Sort by index
df_sorted_index = df_sorted_multi.sort_index()
print("\nDataFrame sorted back by index:")
print(df_sorted_index)
This practical session walked through applying the core data cleaning and modification techniques covered in this chapter: finding and handling missing data, adding, removing, and renaming columns, and sorting data. These are fundamental steps in preparing almost any dataset for meaningful analysis. As you work with more complex data, you'll combine these techniques and explore more advanced Pandas features, but this foundation is essential.
© 2025 ApX Machine Learning