Let's put the techniques discussed in this chapter into practice. We'll work through a common workflow: identifying issues in a raw dataset, cleaning them up, and modifying the structure to make it more suitable for analysis.

Setting Up Our Example DataFrame

First, we need some data to work with. We'll create a small DataFrame representing product information, intentionally including some common problems like missing values and inconsistent naming. Make sure you have Pandas imported, typically using import pandas as pd and NumPy using import numpy as np.

import pandas as pd
import numpy as np

# Create the initial DataFrame
data = {'ProductID': ['P101', 'P102', 'P103', 'P104', 'P105', 'P106'],
        'Category': ['A', 'B', 'A', 'C', 'B', np.nan],
        'Sales': [150, 200, np.nan, 300, 250, 180],
        'Inventory Count': [25, 0, 10, 5, np.nan, 15],
        'Region': ['North', 'South', 'North', 'West', 'South', 'East']}
df_store = pd.DataFrame(data)

print("Original DataFrame:")
print(df_store)

This DataFrame has information about products, including their category, sales figures, inventory count, and region. Notice the np.nan values, which represent missing data.

Identifying Missing Data

Before we can handle missing data, we need to find it. The isnull() method returns a boolean DataFrame indicating True where data is missing. Combining it with sum() gives us a count of missing values per column.

# Check for missing values
print("\nMissing values per column:")
print(df_store.isnull().sum())

Output:

Missing values per column:
ProductID          0
Category           1
Sales              1
Inventory Count    1
Region             0
dtype: int64

This tells us we have one missing value each in the Category, Sales, and Inventory Count columns.

Handling Missing Data

We have several options for dealing with NaN values.

Option 1: Dropping Rows with Missing Data

If we decide that any row with missing data is unusable, we can drop those rows using dropna().

# Drop rows containing any NaN values
df_dropped_rows = df_store.dropna()

print("\nDataFrame after dropping rows with NaN:")
print(df_dropped_rows)

Notice that rows with index 2, 4, and 5 (where Sales, Inventory Count, and Category were missing, respectively) have been removed. Be careful with this approach, as dropping rows can lead to significant data loss if missing values are widespread.

Option 2: Filling Missing Data (Imputation)

Often, it's better to fill, or impute, missing values. We can fill with a fixed value or a calculated one.

Let's work with the original df_store again. We might decide to fill the missing Category with a placeholder like 'Unknown' and the missing Sales and Inventory Count with the mean of their respective columns.

# Create a copy to avoid modifying the original df_store directly in this step
df_filled = df_store.copy()

# Fill missing Category with 'Unknown'
df_filled['Category'].fillna('Unknown', inplace=True)

# Calculate mean sales (excluding NaN) and fill missing Sales
mean_sales = df_filled['Sales'].mean()
df_filled['Sales'].fillna(mean_sales, inplace=True)

# Calculate mean inventory (excluding NaN) and fill missing Inventory Count
mean_inventory = df_filled['Inventory Count'].mean()
df_filled['Inventory Count'].fillna(mean_inventory, inplace=True)

print("\nDataFrame after filling NaN values:")
print(df_filled)
print("\nMissing values after filling:")
print(df_filled.isnull().sum())

The inplace=True argument modifies the DataFrame directly. Now, all missing values have been replaced. Filling with the mean is a common technique, but the best strategy depends on the specific data and context. Sometimes filling with the median or mode might be more appropriate, especially if the data has outliers or is categorical.

Dropping Columns or Rows

Sometimes, entire columns or specific rows are irrelevant or problematic. Let's assume the Region column isn't needed for our analysis in the df_filled DataFrame.

# Drop the 'Region' column
df_no_region = df_filled.drop(columns=['Region'])

print("\nDataFrame after dropping the 'Region' column:")
print(df_no_region)

We can also drop rows by their index label. Let's say product 'P104' (index 3) is discontinued and should be removed.

# Drop the row with index 3
df_dropped_row_3 = df_no_region.drop(index=3)

print("\nDataFrame after dropping row with index 3:")
print(df_dropped_row_3)

Adding and Modifying Columns

We often need to create new columns based on existing data or modify existing ones.

Adding a New Column

Let's add a 'SalesPerInventory' column, calculated by dividing 'Sales' by 'Inventory Count'. We need to handle potential division by zero if inventory is 0.

# Use the DataFrame before dropping row 3 for this example
df_to_modify = df_no_region.copy()

# Add a new column 'SalesPerInventory'
# Replace 0 inventory with a small number (or NaN) to avoid division by zero error
df_to_modify['SalesPerInventory'] = df_to_modify['Sales'] / df_to_modify['Inventory Count'].replace(0, np.nan)
# Fill any resulting NaN (from division by NaN or zero) with 0, assuming 0 ratio in those cases
df_to_modify['SalesPerInventory'].fillna(0, inplace=True)


print("\nDataFrame with added 'SalesPerInventory' column:")
print(df_to_modify)

Modifying an Existing Column

Suppose we want to convert the 'ProductID' column to uppercase for consistency.

# Convert 'ProductID' to uppercase
df_to_modify['ProductID'] = df_to_modify['ProductID'].str.upper()

print("\nDataFrame with 'ProductID' converted to uppercase:")
print(df_to_modify)

Here, we used the .str accessor to apply the string method upper() to the 'ProductID' column.

Renaming Columns

Column names might be unclear or contain characters (like spaces) that are inconvenient for coding. Let's rename 'Inventory Count' to something simpler like 'Stock'.

# Rename 'Inventory Count' column
df_renamed = df_to_modify.rename(columns={'Inventory Count': 'Stock'})

print("\nDataFrame with renamed column:")
print(df_renamed)

Sorting Data

Organizing data by sorting can make it easier to understand.

Sorting by Values

Let's sort the DataFrame by 'Sales' in descending order.

# Sort by 'Sales' descending
df_sorted_sales = df_renamed.sort_values(by='Sales', ascending=False)

print("\nDataFrame sorted by Sales (descending):")
print(df_sorted_sales)

We can also sort by multiple columns. Let's sort by 'Category' (ascending) and then by 'Stock' (descending) within each category.

# Sort by 'Category' (ascending) then 'Stock' (descending)
df_sorted_multi = df_renamed.sort_values(by=['Category', 'Stock'], ascending=[True, False])

print("\nDataFrame sorted by Category (asc) and Stock (desc):")
print(df_sorted_multi)

Sorting by Index

If needed, you can also sort by the DataFrame's index using sort_index().

# Sort by index
df_sorted_index = df_sorted_multi.sort_index()

print("\nDataFrame sorted back by index:")
print(df_sorted_index)

This practical session walked through applying the core data cleaning and modification techniques covered in this chapter: finding and handling missing data, adding, removing, and renaming columns, and sorting data. These are fundamental steps in preparing almost any dataset for meaningful analysis. As you work with more complex data, you'll combine these techniques and explore more advanced Pandas features, but this foundation is essential.