All Courses

Practice: Data Wrangling with Pandas

Now that you've learned the fundamental operations for manipulating data with Pandas, it's time to apply these skills in a practical scenario. Data wrangling, the process of cleaning, transforming, and organizing raw data, is a significant part of any data analysis or machine learning project. This section provides hands-on exercises to solidify your understanding of loading, cleaning, transforming, grouping, and merging data using Pandas.

We will work through a series of tasks on a sample dataset, simulating a common workflow you might encounter.

Setting Up the Practice DataFrame

Let's start by creating a sample DataFrame representing product sales data. This dataset includes common issues like missing values and inconsistent data types that we need to address.

import pandas as pd
import numpy as np

# Sample Sales Data
data = {
    'ProductID': ['P001', 'P002', 'P003', 'P004', 'P005', 'P006', 'P007', 'P008', 'P009', 'P010'],
    'Category': ['Electronics', 'Clothing', 'Electronics', 'Home Goods', 'Clothing', 'Electronics', 'Home Goods', None, 'Clothing', 'Electronics'],
    'Price': [599.99, 49.95, 1200.00, 25.50, 35.00, 750.80, None, 15.00, 55.00, 1150.50],
    'UnitsSold': [50, 120, 15, 200, 150, 25, 80, 300, 110, None],
    'Discount': ['5%', '10%', '5%', 'None', '15%', '7%', '10%', '5%', '12%', '8%']
}

sales_df = pd.DataFrame(data)

# Display initial DataFrame
print("Initial Sales Data:")
print(sales_df)

# Display data types and non-null counts
print("\nInitial DataFrame Info:")
sales_df.info()

Task 1: Initial Inspection and Basic Cleaning

First, examine the DataFrame. Use .info() (as shown above) and .describe() to understand its structure, data types, and summary statistics. Identify potential issues like missing values (NaN or None) and incorrect data types.

# Get summary statistics for numerical columns
print("\nSummary Statistics:")
print(sales_df.describe())

# Check for missing values
print("\nMissing Values per Column:")
print(sales_df.isnull().sum())

You'll notice missing values in Category, Price, and UnitsSold. The Discount column is also stored as an object (string) instead of a numerical value, which prevents direct calculation.

Task 2: Handling Missing Data

Address the missing values identified in Task 1. Different strategies might be appropriate for different columns.

Category: Since Category is categorical, we can fill the missing value with a placeholder like 'Unknown' or the mode (most frequent category). Let's use 'Unknown'.
Price: For numerical columns like Price, filling with the mean or median is common. Given the potential for outliers (like P003), the median might be more appropriate.
UnitsSold: Let's fill missing UnitsSold with the mean number of units sold for simplicity in this exercise.

# Fill missing Category with 'Unknown'
sales_df['Category'].fillna('Unknown', inplace=True)

# Calculate median price (excluding NaN) and fill missing Price
median_price = sales_df['Price'].median()
sales_df['Price'].fillna(median_price, inplace=True)

# Calculate mean units sold (excluding NaN) and fill missing UnitsSold
mean_units_sold = sales_df['UnitsSold'].mean()
sales_df['UnitsSold'].fillna(mean_units_sold, inplace=True)

# Verify missing values are handled
print("\nMissing Values After Handling:")
print(sales_df.isnull().sum())

Task 3: Data Transformation and Feature Engineering

Now, let's clean up existing columns and create new, potentially useful ones.

Convert Discount: Transform the Discount column from a string percentage to a numerical float representing the discount factor (e.g., '5%' becomes 0.05). Handle the 'None' value appropriately, perhaps converting it to a 0% discount.
Calculate Total Revenue: Create a new column TotalRevenue calculated as Price * UnitsSold.
Calculate Discounted Revenue: Create another column DiscountedRevenue using the TotalRevenue and the numerical Discount factor.
Convert UnitsSold to Integer: Since UnitsSold represents counts, convert it to an integer type after handling missing values.

# Convert Discount string to a numerical float
# Replace 'None' with '0%', remove '%', convert to float, and divide by 100
sales_df['Discount'] = sales_df['Discount'].replace('None', '0%')
sales_df['DiscountFactor'] = sales_df['Discount'].str.replace('%', '').astype(float) / 100.0

# Convert UnitsSold to integer type
sales_df['UnitsSold'] = sales_df['UnitsSold'].astype(int)

# Create TotalRevenue column
sales_df['TotalRevenue'] = sales_df['Price'] * sales_df['UnitsSold']

# Create DiscountedRevenue column
sales_df['DiscountedRevenue'] = sales_df['TotalRevenue'] * (1 - sales_df['DiscountFactor'])

# Display the transformed DataFrame structure and sample data
print("\nDataFrame After Transformations:")
print(sales_df.info())
print("\nSample Data After Transformations:")
print(sales_df.head())

Task 4: Grouping and Aggregation

Use the groupby() method to gain insights by aggregating data based on categories.

Calculate Average Price per Category: Find the mean price for products in each category.
Calculate Total Units Sold per Category: Sum the number of units sold for each category.
Multiple Aggregations: Calculate the total revenue and the average discount factor per category simultaneously.

# Average price per category
avg_price_per_category = sales_df.groupby('Category')['Price'].mean()
print("\nAverage Price per Category:")
print(avg_price_per_category)

# Total units sold per category
total_units_per_category = sales_df.groupby('Category')['UnitsSold'].sum()
print("\nTotal Units Sold per Category:")
print(total_units_per_category)

# Multiple aggregations: Total revenue and average discount per category
category_summary = sales_df.groupby('Category').agg(
    TotalRevenue=('TotalRevenue', 'sum'),
    AverageDiscount=('DiscountFactor', 'mean')
)
print("\nCategory Summary (Total Revenue & Avg Discount):")
print(category_summary)

Task 5: Merging DataFrames

Imagine you have another dataset containing store information for each product. Let's create a small DataFrame for this and merge it with our sales_df.

# Sample Store Data
store_data = {
    'StoreID': ['S1', 'S2', 'S1', 'S3', 'S2', 'S3', 'S1', 'S2', 'S3', 'S1'],
    'ProductID': ['P001', 'P002', 'P003', 'P004', 'P005', 'P006', 'P007', 'P008', 'P009', 'P010']
}
store_df = pd.DataFrame(store_data)

print("\nStore Information:")
print(store_df)

# Merge sales_df with store_df based on ProductID
full_df = pd.merge(sales_df, store_df, on='ProductID', how='left')

print("\nMerged DataFrame (Sales + Store Info):")
print(full_df.head())

This merge adds the StoreID to each product row based on the common ProductID column.

Task 6: Quick Visualization

Let's create a simple visualization using the grouped data. We can plot the total units sold per category using Plotly Express (often imported as px). This gives a quick visual summary of our aggregated results.

import plotly.express as px

# Prepare data for plotting (using the previously calculated series)
plot_data = total_units_per_category.reset_index() # Convert Series to DataFrame

# Create bar chart
fig = px.bar(plot_data, x='Category', y='UnitsSold',
             title='Total Units Sold per Product Category',
             color='Category',
             color_discrete_map={ # Optional: Use course color palette
                 'Electronics': '#1c7ed6',
                 'Clothing': '#d6336c',
                 'Home Goods': '#0ca678',
                 'Unknown': '#adb5bd'
             },
             labels={'UnitsSold': 'Total Units Sold'})

# Show plot (in a notebook/compatible environment) or display JSON
# fig.show() # Uncomment this line if running in an environment that supports rendering plots

# Display Plotly JSON for web rendering
plotly_json = fig.to_json()
print("\nPlotly Chart JSON:")

Bar chart displaying the sum of units sold grouped by product category.

Practice Summary

Through these tasks, you have practiced:

Inspecting a DataFrame to understand its contents and identify issues.
Handling missing data using different imputation strategies (fillna).
Cleaning data types and transforming columns (e.g., string manipulation, type conversion with .astype()).
Creating new features from existing data (feature engineering).
Aggregating data using groupby() to summarize information by category.
Combining datasets using pd.merge().
Generating a simple visualization from processed data.

These steps represent a common sequence in data preparation. Mastering these Pandas operations is fundamental for preparing data effectively for subsequent analysis or machine learning model training. Feel free to modify the sample data or apply these techniques to other datasets you encounter.

Was this section helpful?