A typical workflow for data processing involves loading a dataset, performing initial checks, and addressing common issues like missing values and duplicates using Pandas.We'll assume you have a basic Python environment set up with Pandas installed, as discussed in Chapter 1.Setting Up and Loading DataFirst, ensure you have Pandas imported. The conventional alias is pd:import pandas as pd import numpy as np # Often useful alongside Pandas print(f"Pandas version: {pd.__version__}") print(f"NumPy version: {np.__version__}")For this exercise, imagine we have a dataset named orders.csv containing simplified customer order information. Let's load it into a Pandas DataFrame. If you don't have this exact file, you can create a similar one or adapt the code for a CSV file you do have.# Assume 'orders.csv' is in the same directory or provide the full path try: df_orders = pd.read_csv('orders.csv') print("Dataset loaded successfully.") except FileNotFoundError: print("Error: 'orders.csv' not found. Please ensure the file exists or update the path.") # As a fallback, create a sample DataFrame for demonstration data = { 'OrderID': [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1003, 1008, 1009, 1010], 'CustomerID': ['CUST-A', 'CUST-B', 'CUST-A', 'CUST-C', 'CUST-B', 'CUST-D', 'CUST-E', 'CUST-A', 'CUST-F', np.nan, 'CUST-G'], 'Product': ['WidgetA', 'WidgetB', 'WidgetA', 'Gadget', 'WidgetB', 'WidgetA', None, 'WidgetA', 'Gadget', 'WidgetC', 'WidgetA'], 'Quantity': [2, 1, 1, 3, 2, 1, 1, 1, 2, 1, 3], 'Price': [10.0, 25.5, 10.0, 55.0, 25.5, 10.0, 10.0, 10.0, 55.0, 30.0, np.nan] } df_orders = pd.DataFrame(data) print("Sample DataFrame created for demonstration.") Initial Data InspectionNow that the data is loaded into df_orders, let's perform the initial checks we discussed.Shape: How many rows and columns does the dataset have?print(f"Dataset dimensions (rows, columns): {df_orders.shape}")Head and Tail: Look at the first few and last few rows to get a feel for the data values and structure.print("First 5 rows:") print(df_orders.head()) print("\nLast 5 rows:") print(df_orders.tail())Data Types and Non-Null Counts: Use .info() for a concise summary including data types and missing value counts per column.print("\nDataFrame Info:") df_orders.info()Pay attention to the Dtype column (are the types appropriate, e.g., numerical columns as int64 or float64, categorical as object or category?) and the Non-Null Count (does it match the total number of rows from .shape? If not, there are missing values).Identifying and Handling Missing DataThe .info() output likely highlighted columns with missing values. Let's get a precise count.Count Missing Values: Use .isnull().sum() to count $NaN$ or None values in each column.print("\nMissing values per column:") print(df_orders.isnull().sum())This confirms which columns need attention (CustomerID, Product, Price in our sample) and how many values are missing in each.Strategies for Handling Missing Data: As discussed previously, we can delete rows/columns or impute values. The best strategy depends on the context and the amount of missing data.Deletion: If only a very small percentage of rows have missing values, and deleting them won't significantly bias the data, dropping them might be acceptable. Let's see how to drop rows with any missing value:# Create a copy to avoid modifying the original DataFrame directly during exploration df_dropped = df_orders.dropna() print(f"\nShape after dropping rows with any NaN: {df_dropped.shape}") # Note: This might remove too much data if NaNs are widespread.Imputation (Numerical): For the Price column (numerical), we could impute with the mean or median. The median is often preferred if outliers are suspected.median_price = df_orders['Price'].median() print(f"\nMedian price for imputation: {median_price}") # Impute missing prices using the median - operating on the original df now df_orders['Price'].fillna(median_price, inplace=True) # Verify imputation print("\nMissing values after Price imputation:") print(df_orders.isnull().sum())Note: The inplace=True argument modifies the DataFrame directly. Be cautious when using it; often, it's safer to assign the result back, like df_orders['Price'] = df_orders['Price'].fillna(median_price).Imputation (Categorical): For CustomerID and Product (categorical), we could use the mode (most frequent value) or introduce a new category like 'Unknown'. Using the mode might be less suitable if there isn't a dominant category. Let's use 'Unknown' for Product and demonstrate mode imputation for CustomerID (though using 'Unknown' might also be valid here).# Impute Product with 'Unknown' df_orders['Product'].fillna('Unknown', inplace=True) # Impute CustomerID with the mode mode_customer = df_orders['CustomerID'].mode()[0] # .mode() can return multiple values if tied print(f"\nMode CustomerID for imputation: {mode_customer}") df_orders['CustomerID'].fillna(mode_customer, inplace=True) # Final check for missing values print("\nMissing values after all imputations:") print(df_orders.isnull().sum())Detecting and Handling Duplicate RecordsDuplicate rows can skew analysis. Let's check for and handle them.Identify Duplicates: Count the number of fully duplicate rows.duplicate_count = df_orders.duplicated().sum() print(f"\nNumber of duplicate rows found: {duplicate_count}")Inspect Duplicates (Optional): If duplicates are found, you might want to see them.if duplicate_count > 0: print("\nDuplicate rows:") print(df_orders[df_orders.duplicated(keep=False)]) # Show all occurrencesNote: keep=False marks all duplicates as True, making them easy to filter and view.Remove Duplicates: Use .drop_duplicates(). The keep argument determines which duplicate to retain ('first' is default, 'last', or False to drop all). We'll keep the first occurrence.# Keep the first occurrence and remove subsequent duplicates original_rows = df_orders.shape[0] df_orders.drop_duplicates(inplace=True) print(f"\nShape after dropping duplicates: {df_orders.shape}") print(f"Number of rows removed: {original_rows - df_orders.shape[0]}") # Verify duplicates are gone print(f"\nNumber of duplicate rows after removal: {df_orders.duplicated().sum()}")Final CheckAfter loading, inspecting, handling missing values, and removing duplicates, let's take one last look at the cleaned data's information.print("\nFinal DataFrame Info after initial cleanup:") df_orders.info() print("\nFirst 5 rows of cleaned data:") print(df_orders.head())The data should now have consistent data types (though you might refine these later, e.g., converting object to category), no missing values, and no duplicate rows. It's now in a much better state for the univariate and bivariate analyses covered in the upcoming chapters. This practical exercise demonstrates the essential first steps in preparing almost any dataset for exploration. Remember that the specific imputation strategies might change based on deeper domain knowledge or the goals of your analysis.