Often, analyzing data requires segmenting it based on combinations of categories rather than just a single one. For instance, you might want to calculate average sales not just per region, but per product within each region. Pandas groupby() makes this straightforward by allowing you to group by multiple columns simultaneously.

Why Group by More Than One Column?

Grouping by a single column gives you summaries for broad categories. Grouping by multiple columns allows for a more detailed, hierarchical analysis. It helps answer questions like:

What is the average rainfall per month and per city?
How many units of each product were sold in each specific store?
What is the maximum score achieved by students in each grade level within each school?

This multi-level grouping provides finer-grained insights compared to single-level aggregation.

The Syntax

To group by multiple columns, you simply pass a list of column names to the groupby() method instead of a single string.

Let's set up a sample DataFrame to illustrate:

import pandas as pd
import numpy as np

# Sample sales data
data = {'Region': ['North', 'South', 'North', 'South', 'North', 'South', 'North', 'South'],
        'Product': ['A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'],
        'Sales': [100, 150, 200, 50, 120, 180, 210, 80],
        'Quantity': [10, 15, 20, 5, 12, 18, 21, 8]}
df_sales = pd.DataFrame(data)

print(df_sales)

Output:

  Region Product  Sales  Quantity
0  North       A    100        10
1  South       A    150        15
2  North       B    200        20
3  South       B     50         5
4  North       A    120        12
5  South       A    180        18
6  North       B    210        21
7  South       B     80         8

Now, let's group by both 'Region' and 'Product' and calculate the total sales for each combination:

# Group by 'Region' and 'Product'
grouped_multi = df_sales.groupby(['Region', 'Product'])

# Calculate total sales for each group
total_sales_multi = grouped_multi['Sales'].sum()

print(total_sales_multi)

Output:

Region  Product
North   A          220
        B          410
South   A          330
        B          130
Name: Sales, dtype: int64

Understanding the Result: MultiIndex

Notice the output of the aggregation (total_sales_multi). The index is no longer a simple list of labels. Instead, it has two levels: 'Region' and 'Product'. This is called a MultiIndex (or hierarchical index) in Pandas. It represents the unique combinations of the grouping columns.

The GroupBy object created by df_sales.groupby(['Region', 'Product']) partitions the original DataFrame based on the unique pairs of ('North', 'A'), ('North', 'B'), ('South', 'A'), and ('South', 'B'). The sum() aggregation is then applied to the 'Sales' column within each of these partitions.

Aggregating Multiple Columns or Using Multiple Functions

Just like with single-column grouping, you can apply aggregations to multiple columns or use multiple aggregation functions when grouping by multiple columns.

For example, let's find the total 'Sales' and the average 'Quantity' for each Region-Product combination:

# Apply different aggregations to different columns
agg_results = grouped_multi.agg(
    Total_Sales=('Sales', 'sum'),
    Average_Quantity=('Quantity', 'mean')
)

print(agg_results)

Output:

                Total_Sales  Average_Quantity
Region Product
North  A                220              11.0
       B                410              20.5
South  A                330              16.5
       B                130               6.5

The result is a DataFrame where the index is the MultiIndex ('Region', 'Product'), and the columns represent the results of the specified aggregations ('Total_Sales', 'Average_Quantity').

You can also apply multiple aggregation functions to the same column:

# Apply multiple aggregations to the 'Sales' column
sales_stats = grouped_multi['Sales'].agg(['sum', 'mean', 'count'])

print(sales_stats)

Output:

                sum   mean  count
Region Product
North  A        220  110.0      2
       B        410  205.0      2
South  A        330  165.0      2
       B        130   65.0      2

This gives us the sum, mean, and count of sales for each unique ('Region', 'Product') pair.

Grouping by multiple columns is a powerful technique for performing detailed categorical analysis, allowing you to summarize your data across combinations of factors. The resulting MultiIndex structure efficiently represents these hierarchical groupings.