Often, analyzing data requires segmenting it based on combinations of categories rather than just a single one. For instance, you might want to calculate average sales not just per region, but per product within each region. Pandas groupby()
makes this straightforward by allowing you to group by multiple columns simultaneously.
Grouping by a single column gives you summaries for broad categories. Grouping by multiple columns allows for a more detailed, hierarchical analysis. It helps answer questions like:
This multi-level grouping provides finer-grained insights compared to single-level aggregation.
To group by multiple columns, you simply pass a list of column names to the groupby()
method instead of a single string.
Let's set up a sample DataFrame to illustrate:
import pandas as pd
import numpy as np
# Sample sales data
data = {'Region': ['North', 'South', 'North', 'South', 'North', 'South', 'North', 'South'],
'Product': ['A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'],
'Sales': [100, 150, 200, 50, 120, 180, 210, 80],
'Quantity': [10, 15, 20, 5, 12, 18, 21, 8]}
df_sales = pd.DataFrame(data)
print(df_sales)
Output:
Region Product Sales Quantity
0 North A 100 10
1 South A 150 15
2 North B 200 20
3 South B 50 5
4 North A 120 12
5 South A 180 18
6 North B 210 21
7 South B 80 8
Now, let's group by both 'Region' and 'Product' and calculate the total sales for each combination:
# Group by 'Region' and 'Product'
grouped_multi = df_sales.groupby(['Region', 'Product'])
# Calculate total sales for each group
total_sales_multi = grouped_multi['Sales'].sum()
print(total_sales_multi)
Output:
Region Product
North A 220
B 410
South A 330
B 130
Name: Sales, dtype: int64
Notice the output of the aggregation (total_sales_multi
). The index is no longer a simple list of labels. Instead, it has two levels: 'Region' and 'Product'. This is called a MultiIndex
(or hierarchical index) in Pandas. It represents the unique combinations of the grouping columns.
The GroupBy
object created by df_sales.groupby(['Region', 'Product'])
partitions the original DataFrame based on the unique pairs of ('North', 'A'), ('North', 'B'), ('South', 'A'), and ('South', 'B'). The sum()
aggregation is then applied to the 'Sales' column within each of these partitions.
Just like with single-column grouping, you can apply aggregations to multiple columns or use multiple aggregation functions when grouping by multiple columns.
For example, let's find the total 'Sales' and the average 'Quantity' for each Region-Product combination:
# Apply different aggregations to different columns
agg_results = grouped_multi.agg(
Total_Sales=('Sales', 'sum'),
Average_Quantity=('Quantity', 'mean')
)
print(agg_results)
Output:
Total_Sales Average_Quantity
Region Product
North A 220 11.0
B 410 20.5
South A 330 16.5
B 130 6.5
The result is a DataFrame where the index is the MultiIndex
('Region', 'Product'), and the columns represent the results of the specified aggregations ('Total_Sales', 'Average_Quantity').
You can also apply multiple aggregation functions to the same column:
# Apply multiple aggregations to the 'Sales' column
sales_stats = grouped_multi['Sales'].agg(['sum', 'mean', 'count'])
print(sales_stats)
Output:
sum mean count
Region Product
North A 220 110.0 2
B 410 205.0 2
South A 330 165.0 2
B 130 65.0 2
This gives us the sum, mean, and count of sales for each unique ('Region', 'Product') pair.
Grouping by multiple columns is a powerful technique for performing detailed categorical analysis, allowing you to summarize your data across combinations of factors. The resulting MultiIndex
structure efficiently represents these hierarchical groupings.
© 2025 ApX Machine Learning