Data wrangling, the process of cleaning, transforming, and organizing raw data, is a main part of any data analysis or machine learning project. Practice applying data manipulation operations with Pandas in practical scenarios. Hands-on exercises help solidify understanding of loading, cleaning, transforming, grouping, and merging data with Pandas.We will work through a series of tasks on a sample dataset, simulating a common workflow you might encounter.Setting Up the Practice DataFrameLet's start by creating a sample DataFrame representing product sales data. This dataset includes common issues like missing values and inconsistent data types that we need to address.import pandas as pd import numpy as np # Sample Sales Data data = { 'ProductID': ['P001', 'P002', 'P003', 'P004', 'P005', 'P006', 'P007', 'P008', 'P009', 'P010'], 'Category': ['Electronics', 'Clothing', 'Electronics', 'Home Goods', 'Clothing', 'Electronics', 'Home Goods', None, 'Clothing', 'Electronics'], 'Price': [599.99, 49.95, 1200.00, 25.50, 35.00, 750.80, None, 15.00, 55.00, 1150.50], 'UnitsSold': [50, 120, 15, 200, 150, 25, 80, 300, 110, None], 'Discount': ['5%', '10%', '5%', 'None', '15%', '7%', '10%', '5%', '12%', '8%'] } sales_df = pd.DataFrame(data) # Display initial DataFrame print("Initial Sales Data:") print(sales_df) # Display data types and non-null counts print("\nInitial DataFrame Info:") sales_df.info()Task 1: Initial Inspection and Basic CleaningFirst, examine the DataFrame. Use .info() (as shown above) and .describe() to understand its structure, data types, and summary statistics. Identify potential issues like missing values (NaN or None) and incorrect data types.# Get summary statistics for numerical columns print("\nSummary Statistics:") print(sales_df.describe()) # Check for missing values print("\nMissing Values per Column:") print(sales_df.isnull().sum())You'll notice missing values in Category, Price, and UnitsSold. The Discount column is also stored as an object (string) instead of a numerical value, which prevents direct calculation.Task 2: Handling Missing DataAddress the missing values identified in Task 1. Different strategies might be appropriate for different columns.Category: Since Category is categorical, we can fill the missing value with a placeholder like 'Unknown' or the mode (most frequent category). Let's use 'Unknown'.Price: For numerical columns like Price, filling with the mean or median is common. Given the potential for outliers (like P003), the median might be more appropriate.UnitsSold: Let's fill missing UnitsSold with the mean number of units sold for simplicity in this exercise.# Fill missing Category with 'Unknown' sales_df['Category'].fillna('Unknown', inplace=True) # Calculate median price (excluding NaN) and fill missing Price median_price = sales_df['Price'].median() sales_df['Price'].fillna(median_price, inplace=True) # Calculate mean units sold (excluding NaN) and fill missing UnitsSold mean_units_sold = sales_df['UnitsSold'].mean() sales_df['UnitsSold'].fillna(mean_units_sold, inplace=True) # Verify missing values are handled print("\nMissing Values After Handling:") print(sales_df.isnull().sum())Task 3: Data Transformation and Feature EngineeringNow, let's clean up existing columns and create new, potentially useful ones.Convert Discount: Transform the Discount column from a string percentage to a numerical float representing the discount factor (e.g., '5%' becomes 0.05). Handle the 'None' value appropriately, perhaps converting it to a 0% discount.Calculate Total Revenue: Create a new column TotalRevenue calculated as Price * UnitsSold.Calculate Discounted Revenue: Create another column DiscountedRevenue using the TotalRevenue and the numerical Discount factor.Convert UnitsSold to Integer: Since UnitsSold represents counts, convert it to an integer type after handling missing values.# Convert Discount string to a numerical float # Replace 'None' with '0%', remove '%', convert to float, and divide by 100 sales_df['Discount'] = sales_df['Discount'].replace('None', '0%') sales_df['DiscountFactor'] = sales_df['Discount'].str.replace('%', '').astype(float) / 100.0 # Convert UnitsSold to integer type sales_df['UnitsSold'] = sales_df['UnitsSold'].astype(int) # Create TotalRevenue column sales_df['TotalRevenue'] = sales_df['Price'] * sales_df['UnitsSold'] # Create DiscountedRevenue column sales_df['DiscountedRevenue'] = sales_df['TotalRevenue'] * (1 - sales_df['DiscountFactor']) # Display the transformed DataFrame structure and sample data print("\nDataFrame After Transformations:") print(sales_df.info()) print("\nSample Data After Transformations:") print(sales_df.head())Task 4: Grouping and AggregationUse the groupby() method to gain insights by aggregating data based on categories.Calculate Average Price per Category: Find the mean price for products in each category.Calculate Total Units Sold per Category: Sum the number of units sold for each category.Multiple Aggregations: Calculate the total revenue and the average discount factor per category simultaneously.# Average price per category avg_price_per_category = sales_df.groupby('Category')['Price'].mean() print("\nAverage Price per Category:") print(avg_price_per_category) # Total units sold per category total_units_per_category = sales_df.groupby('Category')['UnitsSold'].sum() print("\nTotal Units Sold per Category:") print(total_units_per_category) # Multiple aggregations: Total revenue and average discount per category category_summary = sales_df.groupby('Category').agg( TotalRevenue=('TotalRevenue', 'sum'), AverageDiscount=('DiscountFactor', 'mean') ) print("\nCategory Summary (Total Revenue & Avg Discount):") print(category_summary)Task 5: Merging DataFramesImagine you have another dataset containing store information for each product. Let's create a small DataFrame for this and merge it with our sales_df.# Sample Store Data store_data = { 'StoreID': ['S1', 'S2', 'S1', 'S3', 'S2', 'S3', 'S1', 'S2', 'S3', 'S1'], 'ProductID': ['P001', 'P002', 'P003', 'P004', 'P005', 'P006', 'P007', 'P008', 'P009', 'P010'] } store_df = pd.DataFrame(store_data) print("\nStore Information:") print(store_df) # Merge sales_df with store_df based on ProductID full_df = pd.merge(sales_df, store_df, on='ProductID', how='left') print("\nMerged DataFrame (Sales + Store Info):") print(full_df.head())This merge adds the StoreID to each product row based on the common ProductID column.Task 6: Quick VisualizationLet's create a simple visualization using the grouped data. We can plot the total units sold per category using Plotly Express (often imported as px). This gives a quick visual summary of our aggregated results.import plotly.express as px # Prepare data for plotting (using the previously calculated series) plot_data = total_units_per_category.reset_index() # Convert Series to DataFrame # Create bar chart fig = px.bar(plot_data, x='Category', y='UnitsSold', title='Total Units Sold per Product Category', color='Category', color_discrete_map={ # Optional: Use course color palette 'Electronics': '#1c7ed6', 'Clothing': '#d6336c', 'Home Goods': '#0ca678', 'Unknown': '#adb5bd' }, labels={'UnitsSold': 'Total Units Sold'}) # Show plot (in a notebook/compatible environment) or display JSON # fig.show() # Uncomment this line if running in an environment that supports rendering plots # Display Plotly JSON for web rendering plotly_json = fig.to_json() print("\nPlotly Chart JSON:"){"data":[{"marker":{"color":"#1c7ed6","pattern":{"shape":""}},"name":"Electronics","orientation":"v","showlegend":true,"text":"33","type":"bar","x":["Electronics"],"xaxis":"x","y":[40]},{"marker":{"color":"#d6336c","pattern":{"shape":""}},"name":"Clothing","orientation":"v","showlegend":true,"text":"120<br>150<br>110","type":"bar","x":["Clothing"],"xaxis":"x","y":[380]},{"marker":{"color":"#0ca678","pattern":{"shape":""}},"name":"Home Goods","orientation":"v","showlegend":true,"text":"200<br>80","type":"bar","x":["Home Goods"],"xaxis":"x","y":[280]},{"marker":{"color":"#adb5bd","pattern":{"shape":""}},"name":"Unknown","orientation":"v","showlegend":true,"text":"300","type":"bar","x":["Unknown"],"xaxis":"x","y":[300]}],"layout":{"legend":{"tracegroupgap":0,"orientation":"v"},"margin":{"t":60},"template":"plotly","title":{"text":"Total Units Sold per Product Category"},"xaxis":{"anchor":"y","domain":[0.0,1.0],"title":{"text":"Category"}},"yaxis":{"anchor":"x","domain":[0.0,1.0],"title":{"text":"Total Units Sold"}}}}Bar chart displaying the sum of units sold grouped by product category.Practice SummaryThrough these tasks, you have practiced:Inspecting a DataFrame to understand its contents and identify issues.Handling missing data using different imputation strategies (fillna).Cleaning data types and transforming columns (e.g., string manipulation, type conversion with .astype()).Creating new features from existing data (feature engineering).Aggregating data using groupby() to summarize information by category.Combining datasets using pd.merge().Generating a simple visualization from processed data.These steps represent a common sequence in data preparation. Mastering these Pandas operations is fundamental for preparing data effectively for subsequent analysis or machine learning model training. Feel free to modify the sample data or apply these techniques to other datasets you encounter.