Now that you've learned the fundamental operations for manipulating data with Pandas, it's time to apply these skills in a practical scenario. Data wrangling, the process of cleaning, transforming, and organizing raw data, is a significant part of any data analysis or machine learning project. This section provides hands-on exercises to solidify your understanding of loading, cleaning, transforming, grouping, and merging data using Pandas.
We will work through a series of tasks on a sample dataset, simulating a common workflow you might encounter.
Let's start by creating a sample DataFrame representing product sales data. This dataset includes common issues like missing values and inconsistent data types that we need to address.
import pandas as pd
import numpy as np
# Sample Sales Data
data = {
'ProductID': ['P001', 'P002', 'P003', 'P004', 'P005', 'P006', 'P007', 'P008', 'P009', 'P010'],
'Category': ['Electronics', 'Clothing', 'Electronics', 'Home Goods', 'Clothing', 'Electronics', 'Home Goods', None, 'Clothing', 'Electronics'],
'Price': [599.99, 49.95, 1200.00, 25.50, 35.00, 750.80, None, 15.00, 55.00, 1150.50],
'UnitsSold': [50, 120, 15, 200, 150, 25, 80, 300, 110, None],
'Discount': ['5%', '10%', '5%', 'None', '15%', '7%', '10%', '5%', '12%', '8%']
}
sales_df = pd.DataFrame(data)
# Display initial DataFrame
print("Initial Sales Data:")
print(sales_df)
# Display data types and non-null counts
print("\nInitial DataFrame Info:")
sales_df.info()
First, examine the DataFrame. Use .info()
(as shown above) and .describe()
to understand its structure, data types, and summary statistics. Identify potential issues like missing values (NaN or None) and incorrect data types.
# Get summary statistics for numerical columns
print("\nSummary Statistics:")
print(sales_df.describe())
# Check for missing values
print("\nMissing Values per Column:")
print(sales_df.isnull().sum())
You'll notice missing values in Category
, Price
, and UnitsSold
. The Discount
column is also stored as an object (string) instead of a numerical value, which prevents direct calculation.
Address the missing values identified in Task 1. Different strategies might be appropriate for different columns.
Category
is categorical, we can fill the missing value with a placeholder like 'Unknown' or the mode (most frequent category). Let's use 'Unknown'.Price
, filling with the mean or median is common. Given the potential for outliers (like P003), the median might be more robust.UnitsSold
with the mean number of units sold for simplicity in this exercise.# Fill missing Category with 'Unknown'
sales_df['Category'].fillna('Unknown', inplace=True)
# Calculate median price (excluding NaN) and fill missing Price
median_price = sales_df['Price'].median()
sales_df['Price'].fillna(median_price, inplace=True)
# Calculate mean units sold (excluding NaN) and fill missing UnitsSold
mean_units_sold = sales_df['UnitsSold'].mean()
sales_df['UnitsSold'].fillna(mean_units_sold, inplace=True)
# Verify missing values are handled
print("\nMissing Values After Handling:")
print(sales_df.isnull().sum())
Now, let's clean up existing columns and create new, potentially useful ones.
Discount
column from a string percentage to a numerical float representing the discount factor (e.g., '5%' becomes 0.05). Handle the 'None' value appropriately, perhaps converting it to a 0% discount.TotalRevenue
calculated as Price * UnitsSold
.DiscountedRevenue
using the TotalRevenue
and the numerical Discount
factor.UnitsSold
represents counts, convert it to an integer type after handling missing values.# Convert Discount string to a numerical float
# Replace 'None' with '0%', remove '%', convert to float, and divide by 100
sales_df['Discount'] = sales_df['Discount'].replace('None', '0%')
sales_df['DiscountFactor'] = sales_df['Discount'].str.replace('%', '').astype(float) / 100.0
# Convert UnitsSold to integer type
sales_df['UnitsSold'] = sales_df['UnitsSold'].astype(int)
# Create TotalRevenue column
sales_df['TotalRevenue'] = sales_df['Price'] * sales_df['UnitsSold']
# Create DiscountedRevenue column
sales_df['DiscountedRevenue'] = sales_df['TotalRevenue'] * (1 - sales_df['DiscountFactor'])
# Display the transformed DataFrame structure and sample data
print("\nDataFrame After Transformations:")
print(sales_df.info())
print("\nSample Data After Transformations:")
print(sales_df.head())
Use the groupby()
method to gain insights by aggregating data based on categories.
# Average price per category
avg_price_per_category = sales_df.groupby('Category')['Price'].mean()
print("\nAverage Price per Category:")
print(avg_price_per_category)
# Total units sold per category
total_units_per_category = sales_df.groupby('Category')['UnitsSold'].sum()
print("\nTotal Units Sold per Category:")
print(total_units_per_category)
# Multiple aggregations: Total revenue and average discount per category
category_summary = sales_df.groupby('Category').agg(
TotalRevenue=('TotalRevenue', 'sum'),
AverageDiscount=('DiscountFactor', 'mean')
)
print("\nCategory Summary (Total Revenue & Avg Discount):")
print(category_summary)
Imagine you have another dataset containing store information for each product. Let's create a small DataFrame for this and merge it with our sales_df
.
# Sample Store Data
store_data = {
'StoreID': ['S1', 'S2', 'S1', 'S3', 'S2', 'S3', 'S1', 'S2', 'S3', 'S1'],
'ProductID': ['P001', 'P002', 'P003', 'P004', 'P005', 'P006', 'P007', 'P008', 'P009', 'P010']
}
store_df = pd.DataFrame(store_data)
print("\nStore Information:")
print(store_df)
# Merge sales_df with store_df based on ProductID
full_df = pd.merge(sales_df, store_df, on='ProductID', how='left')
print("\nMerged DataFrame (Sales + Store Info):")
print(full_df.head())
This merge adds the StoreID
to each product row based on the common ProductID
column.
Let's create a simple visualization using the grouped data. We can plot the total units sold per category using Plotly Express (often imported as px
). This gives a quick visual summary of our aggregated results.
import plotly.express as px
# Prepare data for plotting (using the previously calculated series)
plot_data = total_units_per_category.reset_index() # Convert Series to DataFrame
# Create bar chart
fig = px.bar(plot_data, x='Category', y='UnitsSold',
title='Total Units Sold per Product Category',
color='Category',
color_discrete_map={ # Optional: Use course color palette
'Electronics': '#1c7ed6',
'Clothing': '#d6336c',
'Home Goods': '#0ca678',
'Unknown': '#adb5bd'
},
labels={'UnitsSold': 'Total Units Sold'})
# Show plot (in a notebook/compatible environment) or display JSON
# fig.show() # Uncomment this line if running in an environment that supports rendering plots
# Display Plotly JSON for web rendering
plotly_json = fig.to_json()
print("\nPlotly Chart JSON:")
Bar chart displaying the sum of units sold grouped by product category.
Through these tasks, you have practiced:
fillna
)..astype()
).groupby()
to summarize information by category.pd.merge()
.These steps represent a common sequence in data preparation. Mastering these Pandas operations is fundamental for preparing data effectively for subsequent analysis or machine learning model training. Feel free to modify the sample data or apply these techniques to other datasets you encounter.
© 2025 ApX Machine Learning