Often, when analyzing grouped data, calculating a single summary statistic like the mean or sum isn't enough. You might want to see several different summaries for each group simultaneously. For instance, for each product category, you might want to know both the total sales and the average sale amount. Pandas provides a flexible way to achieve this using the agg()
method on a GroupBy object.
Let's start with a sample DataFrame representing some sales data:
import pandas as pd
import numpy as np
data = {'Category': ['Electronics', 'Clothing', 'Electronics', 'Clothing', 'Groceries', 'Electronics', 'Groceries'],
'Product': ['Laptop', 'T-Shirt', 'Mouse', 'Jeans', 'Apples', 'Keyboard', 'Bananas'],
'Sales': [1200, 25, 20, 50, 5, 75, 3],
'Quantity': [1, 2, 1, 1, 10, 1, 8]}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Group by Category
grouped = df.groupby('Category')
Running this will show our starting data:
Original DataFrame:
Category Product Sales Quantity
0 Electronics Laptop 1200 1
1 Clothing T-Shirt 25 2
2 Electronics Mouse 20 1
3 Clothing Jeans 50 1
4 Groceries Apples 5 10
5 Electronics Keyboard 75 1
6 Groceries Bananas 3 8
The simplest way to apply multiple aggregation functions is to pass a list of function names (as strings) to the agg()
method. Pandas will apply each function in the list to every numeric column in the grouped selection.
# Calculate sum and mean for Sales and Quantity within each Category
multi_agg_list = grouped[['Sales', 'Quantity']].agg(['sum', 'mean'])
print("\nMultiple aggregations using a list:")
print(multi_agg_list)
The output shows the sum and mean calculated for both the 'Sales' and 'Quantity' columns within each category:
Multiple aggregations using a list:
Sales Quantity
sum mean sum mean
Category
Clothing 75 37.50 3 1.5
Electronics 1295 431.67 3 1.0
Groceries 8 4.00 18 9.0
Notice the result has hierarchical column labels (MultiIndex). The top level indicates the original column ('Sales', 'Quantity'), and the second level indicates the aggregation function ('sum', 'mean').
What if you want to apply different functions to different columns? For example, maybe you want the total sales (sum
) but the average quantity (mean
) per category. You can achieve this by passing a dictionary to agg()
. The dictionary keys should be the column names you want to aggregate, and the values should be the function (or list of functions) to apply to that specific column.
# Calculate sum of Sales and mean of Quantity per Category
multi_agg_dict = grouped.agg({
'Sales': 'sum', # Apply sum to the 'Sales' column
'Quantity': 'mean' # Apply mean to the 'Quantity' column
})
print("\nMultiple aggregations using a dictionary:")
print(multi_agg_dict)
This gives a cleaner output without hierarchical columns, as each specified aggregation results in a single output column:
Multiple aggregations using a dictionary:
Sales Quantity
Category
Clothing 75 1.5
Electronics 1295 1.0
Groceries 8 9.0
You can also apply multiple functions to a specific column using a list within the dictionary:
# Calculate sum and mean for Sales, and just the sum for Quantity
multi_agg_dict_list = grouped.agg({
'Sales': ['sum', 'mean'], # Apply sum and mean to 'Sales'
'Quantity': 'sum' # Apply sum to 'Quantity'
})
print("\nDictionary aggregation with a list of functions:")
print(multi_agg_dict_list)
The output now has hierarchical columns only for 'Sales', where multiple functions were applied:
Dictionary aggregation with a list of functions:
Sales Quantity
sum mean sum
Category
Clothing 75 37.50 3
Electronics 1295 431.67 3
Groceries 8 4.00 18
While the dictionary method is useful, managing potentially complex hierarchical column names can become cumbersome. A more modern and often clearer approach is to use named aggregations. This lets you explicitly define the names of the output columns.
You pass keyword arguments to agg()
, where the keyword is your desired output column name. The value associated with each keyword is a tuple containing (column_name_to_aggregate, aggregation_function)
.
# Calculate Total Sales, Average Quantity, and Product Count per Category
named_agg = grouped.agg(
Total_Sales = pd.NamedAgg(column='Sales', aggfunc='sum'),
Avg_Quantity = pd.NamedAgg(column='Quantity', aggfunc='mean'),
Num_Products = pd.NamedAgg(column='Product', aggfunc='count') # Can aggregate non-numeric too
)
print("\nNamed aggregations for clearer output columns:")
print(named_agg)
This produces a DataFrame with clearly named columns reflecting the specific aggregations performed:
Named aggregations for clearer output columns:
Total_Sales Avg_Quantity Num_Products
Category
Clothing 75 1.5 2
Electronics 1295 1.0 3
Groceries 8 9.0 2
This named aggregation syntax (pd.NamedAgg
) is generally recommended when applying multiple, potentially different, aggregations, as it leads to more readable code and predictable output structures.
You can even apply custom functions defined using def
or lambda expressions within agg()
, although for standard statistics, the built-in function names (as strings) are usually sufficient and more efficient.
Being able to calculate multiple summary statistics per group using agg()
is a very common and useful pattern in data analysis, allowing you to quickly generate insightful summaries from your grouped data.
© 2025 ApX Machine Learning