Alright, let's put the concepts from this chapter into practice. Theory is essential, but applying these techniques to actual data solidifies understanding. In this section, we'll work through a typical workflow: loading a dataset, performing initial checks, and addressing common issues like missing values and duplicates using Pandas.
We'll assume you have a basic Python environment set up with Pandas installed, as discussed in Chapter 1.
First, ensure you have Pandas imported. The conventional alias is pd
:
import pandas as pd
import numpy as np # Often useful alongside Pandas
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
For this exercise, imagine we have a dataset named orders.csv
containing simplified customer order information. Let's load it into a Pandas DataFrame. If you don't have this exact file, you can create a similar one or adapt the code for a CSV file you do have.
# Assume 'orders.csv' is in the same directory or provide the full path
try:
df_orders = pd.read_csv('orders.csv')
print("Dataset loaded successfully.")
except FileNotFoundError:
print("Error: 'orders.csv' not found. Please ensure the file exists or update the path.")
# As a fallback, create a sample DataFrame for demonstration
data = {
'OrderID': [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1003, 1008, 1009, 1010],
'CustomerID': ['CUST-A', 'CUST-B', 'CUST-A', 'CUST-C', 'CUST-B', 'CUST-D', 'CUST-E', 'CUST-A', 'CUST-F', np.nan, 'CUST-G'],
'Product': ['WidgetA', 'WidgetB', 'WidgetA', 'Gadget', 'WidgetB', 'WidgetA', None, 'WidgetA', 'Gadget', 'WidgetC', 'WidgetA'],
'Quantity': [2, 1, 1, 3, 2, 1, 1, 1, 2, 1, 3],
'Price': [10.0, 25.5, 10.0, 55.0, 25.5, 10.0, 10.0, 10.0, 55.0, 30.0, np.nan]
}
df_orders = pd.DataFrame(data)
print("Sample DataFrame created for demonstration.")
Now that the data is loaded into df_orders
, let's perform the initial checks we discussed.
Shape: How many rows and columns does the dataset have?
print(f"Dataset dimensions (rows, columns): {df_orders.shape}")
Head and Tail: Look at the first few and last few rows to get a feel for the data values and structure.
print("First 5 rows:")
print(df_orders.head())
print("\nLast 5 rows:")
print(df_orders.tail())
Data Types and Non-Null Counts: Use .info()
for a concise summary including data types and missing value counts per column.
print("\nDataFrame Info:")
df_orders.info()
Pay attention to the Dtype
column (are the types appropriate, e.g., numerical columns as int64
or float64
, categorical as object
or category
?) and the Non-Null Count
(does it match the total number of rows from .shape
? If not, there are missing values).
The .info()
output likely highlighted columns with missing values. Let's get a precise count.
Count Missing Values: Use .isnull().sum()
to count NaN or None
values in each column.
print("\nMissing values per column:")
print(df_orders.isnull().sum())
This confirms which columns need attention (CustomerID
, Product
, Price
in our sample) and how many values are missing in each.
Strategies for Handling Missing Data: As discussed previously, we can delete rows/columns or impute values. The best strategy depends on the context and the amount of missing data.
Deletion: If only a very small percentage of rows have missing values, and deleting them won't significantly bias the data, dropping them might be acceptable. Let's see how to drop rows with any missing value:
# Create a copy to avoid modifying the original DataFrame directly during exploration
df_dropped = df_orders.dropna()
print(f"\nShape after dropping rows with any NaN: {df_dropped.shape}")
# Note: This might remove too much data if NaNs are widespread.
Imputation (Numerical): For the Price
column (numerical), we could impute with the mean or median. The median is often preferred if outliers are suspected.
median_price = df_orders['Price'].median()
print(f"\nMedian price for imputation: {median_price}")
# Impute missing prices using the median - operating on the original df now
df_orders['Price'].fillna(median_price, inplace=True)
# Verify imputation
print("\nMissing values after Price imputation:")
print(df_orders.isnull().sum())
Note: The inplace=True
argument modifies the DataFrame directly. Be cautious when using it; often, it's safer to assign the result back, like df_orders['Price'] = df_orders['Price'].fillna(median_price)
.
Imputation (Categorical): For CustomerID
and Product
(categorical), we could use the mode (most frequent value) or introduce a new category like 'Unknown'. Using the mode might be less suitable if there isn't a dominant category. Let's use 'Unknown' for Product
and demonstrate mode imputation for CustomerID
(though using 'Unknown' might also be valid here).
# Impute Product with 'Unknown'
df_orders['Product'].fillna('Unknown', inplace=True)
# Impute CustomerID with the mode
mode_customer = df_orders['CustomerID'].mode()[0] # .mode() can return multiple values if tied
print(f"\nMode CustomerID for imputation: {mode_customer}")
df_orders['CustomerID'].fillna(mode_customer, inplace=True)
# Final check for missing values
print("\nMissing values after all imputations:")
print(df_orders.isnull().sum())
Duplicate rows can skew analysis. Let's check for and handle them.
Identify Duplicates: Count the number of fully duplicate rows.
duplicate_count = df_orders.duplicated().sum()
print(f"\nNumber of duplicate rows found: {duplicate_count}")
Inspect Duplicates (Optional): If duplicates are found, you might want to see them.
if duplicate_count > 0:
print("\nDuplicate rows:")
print(df_orders[df_orders.duplicated(keep=False)]) # Show all occurrences
Note: keep=False
marks all duplicates as True
, making them easy to filter and view.
Remove Duplicates: Use .drop_duplicates()
. The keep
argument determines which duplicate to retain ('first' is default, 'last', or False
to drop all). We'll keep the first occurrence.
# Keep the first occurrence and remove subsequent duplicates
original_rows = df_orders.shape[0]
df_orders.drop_duplicates(inplace=True)
print(f"\nShape after dropping duplicates: {df_orders.shape}")
print(f"Number of rows removed: {original_rows - df_orders.shape[0]}")
# Verify duplicates are gone
print(f"\nNumber of duplicate rows after removal: {df_orders.duplicated().sum()}")
After loading, inspecting, handling missing values, and removing duplicates, let's take one last look at the cleaned data's information.
print("\nFinal DataFrame Info after initial cleanup:")
df_orders.info()
print("\nFirst 5 rows of cleaned data:")
print(df_orders.head())
The data should now have consistent data types (though you might refine these later, e.g., converting object
to category
), no missing values, and no duplicate rows. It's now in a much better state for the univariate and bivariate analyses covered in the upcoming chapters. This practical exercise demonstrates the essential first steps in preparing almost any dataset for exploration. Remember that the specific imputation strategies might change based on deeper domain knowledge or the goals of your analysis.
© 2025 ApX Machine Learning