We'll work through a common workflow: identifying issues in a raw dataset, cleaning them up, and modifying the structure to make it more suitable for analysis.Setting Up Our Example DataFrameFirst, we need some data to work with. We'll create a small DataFrame representing product information, intentionally including some common problems like missing values and inconsistent naming. Make sure you have Pandas imported, typically using import pandas as pd and NumPy using import numpy as np.import pandas as pd import numpy as np # Create the initial DataFrame data = {'ProductID': ['P101', 'P102', 'P103', 'P104', 'P105', 'P106'], 'Category': ['A', 'B', 'A', 'C', 'B', np.nan], 'Sales': [150, 200, np.nan, 300, 250, 180], 'Inventory Count': [25, 0, 10, 5, np.nan, 15], 'Region': ['North', 'South', 'North', 'West', 'South', 'East']} df_store = pd.DataFrame(data) print("Original DataFrame:") print(df_store)This DataFrame has information about products, including their category, sales figures, inventory count, and region. Notice the np.nan values, which represent missing data.Identifying Missing DataBefore we can handle missing data, we need to find it. The isnull() method returns a boolean DataFrame indicating True where data is missing. Combining it with sum() gives us a count of missing values per column.# Check for missing values print("\nMissing values per column:") print(df_store.isnull().sum())Output:Missing values per column: ProductID 0 Category 1 Sales 1 Inventory Count 1 Region 0 dtype: int64This tells us we have one missing value each in the Category, Sales, and Inventory Count columns.Handling Missing DataWe have several options for dealing with NaN values.Option 1: Dropping Rows with Missing DataIf we decide that any row with missing data is unusable, we can drop those rows using dropna().# Drop rows containing any NaN values df_dropped_rows = df_store.dropna() print("\nDataFrame after dropping rows with NaN:") print(df_dropped_rows)Notice that rows with index 2, 4, and 5 (where Sales, Inventory Count, and Category were missing, respectively) have been removed. Be careful with this approach, as dropping rows can lead to significant data loss if missing values are widespread.Option 2: Filling Missing Data (Imputation)Often, it's better to fill, or impute, missing values. We can fill with a fixed value or a calculated one.Let's work with the original df_store again. We might decide to fill the missing Category with a placeholder like 'Unknown' and the missing Sales and Inventory Count with the mean of their respective columns.# Create a copy to avoid modifying the original df_store directly in this step df_filled = df_store.copy() # Fill missing Category with 'Unknown' df_filled['Category'].fillna('Unknown', inplace=True) # Calculate mean sales (excluding NaN) and fill missing Sales mean_sales = df_filled['Sales'].mean() df_filled['Sales'].fillna(mean_sales, inplace=True) # Calculate mean inventory (excluding NaN) and fill missing Inventory Count mean_inventory = df_filled['Inventory Count'].mean() df_filled['Inventory Count'].fillna(mean_inventory, inplace=True) print("\nDataFrame after filling NaN values:") print(df_filled) print("\nMissing values after filling:") print(df_filled.isnull().sum())The inplace=True argument modifies the DataFrame directly. Now, all missing values have been replaced. Filling with the mean is a common technique, but the best strategy depends on the specific data and context. Sometimes filling with the median or mode might be more appropriate, especially if the data has outliers or is categorical.Dropping Columns or RowsSometimes, entire columns or specific rows are irrelevant or problematic. Let's assume the Region column isn't needed for our analysis in the df_filled DataFrame.# Drop the 'Region' column df_no_region = df_filled.drop(columns=['Region']) print("\nDataFrame after dropping the 'Region' column:") print(df_no_region)We can also drop rows by their index label. Let's say product 'P104' (index 3) is discontinued and should be removed.# Drop the row with index 3 df_dropped_row_3 = df_no_region.drop(index=3) print("\nDataFrame after dropping row with index 3:") print(df_dropped_row_3)Adding and Modifying ColumnsWe often need to create new columns based on existing data or modify existing ones.Adding a New ColumnLet's add a 'SalesPerInventory' column, calculated by dividing 'Sales' by 'Inventory Count'. We need to handle potential division by zero if inventory is 0.# Use the DataFrame before dropping row 3 for this example df_to_modify = df_no_region.copy() # Add a new column 'SalesPerInventory' # Replace 0 inventory with a small number (or NaN) to avoid division by zero error df_to_modify['SalesPerInventory'] = df_to_modify['Sales'] / df_to_modify['Inventory Count'].replace(0, np.nan) # Fill any resulting NaN (from division by NaN or zero) with 0, assuming 0 ratio in those cases df_to_modify['SalesPerInventory'].fillna(0, inplace=True) print("\nDataFrame with added 'SalesPerInventory' column:") print(df_to_modify)Modifying an Existing ColumnSuppose we want to convert the 'ProductID' column to uppercase for consistency.# Convert 'ProductID' to uppercase df_to_modify['ProductID'] = df_to_modify['ProductID'].str.upper() print("\nDataFrame with 'ProductID' converted to uppercase:") print(df_to_modify)Here, we used the .str accessor to apply the string method upper() to the 'ProductID' column.Renaming ColumnsColumn names might be unclear or contain characters (like spaces) that are inconvenient for coding. Let's rename 'Inventory Count' to something simpler like 'Stock'.# Rename 'Inventory Count' column df_renamed = df_to_modify.rename(columns={'Inventory Count': 'Stock'}) print("\nDataFrame with renamed column:") print(df_renamed)Sorting DataOrganizing data by sorting can make it easier to understand.Sorting by ValuesLet's sort the DataFrame by 'Sales' in descending order.# Sort by 'Sales' descending df_sorted_sales = df_renamed.sort_values(by='Sales', ascending=False) print("\nDataFrame sorted by Sales (descending):") print(df_sorted_sales)We can also sort by multiple columns. Let's sort by 'Category' (ascending) and then by 'Stock' (descending) within each category.# Sort by 'Category' (ascending) then 'Stock' (descending) df_sorted_multi = df_renamed.sort_values(by=['Category', 'Stock'], ascending=[True, False]) print("\nDataFrame sorted by Category (asc) and Stock (desc):") print(df_sorted_multi)Sorting by IndexIf needed, you can also sort by the DataFrame's index using sort_index().# Sort by index df_sorted_index = df_sorted_multi.sort_index() print("\nDataFrame sorted back by index:") print(df_sorted_index)This practical session walked through applying the core data cleaning and modification techniques covered in this chapter: finding and handling missing data, adding, removing, and renaming columns, and sorting data. These are fundamental steps in preparing almost any dataset for meaningful analysis. As you work with more complex data, you'll combine these techniques and explore more advanced Pandas features, but this foundation is essential.