Let's put the concepts you've learned about grouping and aggregation into practice. We'll work through several examples using a sample dataset to solidify your understanding of the split-apply-combine pattern with Pandas groupby()
.
First, make sure you have Pandas imported. If you're following along in a Jupyter Notebook or Python script, start with this:
import pandas as pd
import numpy as np # Often useful alongside Pandas
Now, let's create a simple DataFrame representing sales data for different product categories across various regions. This will be our data for the exercises.
# Create the sample data
data = {'Category': ['Electronics', 'Clothing', 'Electronics', 'Groceries', 'Clothing', 'Groceries', 'Electronics', 'Clothing'],
'Region': ['North', 'South', 'North', 'East', 'West', 'East', 'West', 'South'],
'Sales': [1200, 300, 1500, 250, 450, 200, 1800, 350],
'Quantity': [2, 3, 1, 5, 4, 4, 3, 4]}
sales_df = pd.DataFrame(data)
print("Sample Sales Data:")
print(sales_df)
This should output:
Sample Sales Data:
Category Region Sales Quantity
0 Electronics North 1200 2
1 Clothing South 300 3
2 Electronics North 1500 1
3 Groceries East 250 5
4 Clothing West 450 4
5 Groceries East 200 4
6 Electronics West 1800 3
7 Clothing South 350 4
Our first task is to find the total sales for each product Category
. This involves splitting the DataFrame by Category
, applying the sum()
function to the Sales
column within each group, and combining the results.
# Group by 'Category' and calculate the sum of 'Sales' for each category
category_sales_sum = sales_df.groupby('Category')['Sales'].sum()
print("\nTotal Sales per Category:")
print(category_sales_sum)
Output:
Total Sales per Category:
Category
Clothing 1100
Electronics 4500
Groceries 450
Name: Sales, dtype: int64
Notice the result is a Pandas Series, where the index is the Category
we grouped by, and the values are the summed Sales
.
What if we want the average quantity sold per category?
# Group by 'Category' and calculate the mean of 'Quantity'
category_quantity_avg = sales_df.groupby('Category')['Quantity'].mean()
print("\nAverage Quantity per Category:")
print(category_quantity_avg)
Output:
Average Quantity per Category:
Category
Clothing 3.666667
Electronics 2.000000
Groceries 4.500000
Name: Quantity, dtype: float64
You can apply other aggregation functions similarly, such as count()
, size()
, min()
, max()
, std()
(standard deviation), etc. Remember count()
excludes missing values (NaN), while size()
includes them.
Often, you'll want to calculate several summary statistics for each group at once. The .agg()
method is perfect for this.
Let's calculate the total sales (sum
) and the average quantity (mean
) for each Category
.
# Group by 'Category' and apply multiple aggregations
category_summary = sales_df.groupby('Category').agg(
Total_Sales=('Sales', 'sum'),
Average_Quantity=('Quantity', 'mean'),
Number_of_Sales=('Sales', 'count') # Count non-missing sales entries
)
print("\nSummary Statistics per Category:")
print(category_summary)
Output:
Summary Statistics per Category:
Total_Sales Average_Quantity Number_of_Sales
Category
Clothing 1100 3.666667 3
Electronics 4500 2.000000 3
Groceries 450 4.500000 2
Here, we passed a dictionary to .agg()
. The keys of the dictionary ('Total_Sales', 'Average_Quantity', 'Number_of_Sales') become the new column names in the resulting DataFrame. The values are tuples where the first element is the original column to aggregate ('Sales' or 'Quantity') and the second element is the aggregation function ('sum', 'mean', 'count'). This provides a clear and organized summary.
Alternatively, you can pass a list of functions if you want to apply the same set of aggregations to all numerical columns (or a selected column):
# Calculate min and max sales for each category
category_sales_min_max = sales_df.groupby('Category')['Sales'].agg(['min', 'max'])
print("\nMin and Max Sales per Category:")
print(category_sales_min_max)
Output:
Min and Max Sales per Category:
min max
Category
Clothing 300 450
Electronics 1200 1800
Groceries 200 250
We can create more granular groups by passing a list of column names to groupby()
. Let's find the total sales for each combination of Region
and Category
.
# Group by both 'Region' and 'Category', then sum 'Sales'
region_category_sales = sales_df.groupby(['Region', 'Category'])['Sales'].sum()
print("\nTotal Sales per Region and Category:")
print(region_category_sales)
Output:
Total Sales per Region and Category:
Region Category
East Groceries 450
North Electronics 2700
South Clothing 650
West Clothing 450
Electronics 1800
Name: Sales, dtype: int64
The result is a Series with a MultiIndex
(Region, Category). This hierarchical index represents the combinations of the groups.
You can apply multiple aggregations here as well:
# Group by 'Region' and 'Category', calculate sum of Sales and mean Quantity
region_category_summary = sales_df.groupby(['Region', 'Category']).agg(
Total_Sales=('Sales', 'sum'),
Average_Quantity=('Quantity', 'mean')
)
print("\nSummary per Region and Category:")
print(region_category_summary)
Output:
Summary per Region and Category:
Total_Sales Average_Quantity
Region Category
East Groceries 450 4.5
North Electronics 2700 1.5
South Clothing 650 3.5
West Clothing 450 4.0
Electronics 1800 3.0
This gives us a DataFrame with the MultiIndex
on the rows.
Aggregated data is often best understood visually. Let's create a simple bar chart of the total sales per category using Plotly.
Total sales generated by each product category. Electronics clearly dominates in this small dataset.
Now, try these on your own using the sales_df
DataFrame:
Sales
for each Region
.size()
) for each Region
.max
quantity sold within each Region
and Category
combination.Sales
and the sum of Quantity
for each Region
. Use the .agg()
method.These exercises reinforce the different ways you can apply groupby()
and aggregation functions to extract meaningful summaries from your data. Mastering grouping is a significant step in data analysis with Pandas.
© 2025 ApX Machine Learning