Once you have created a GroupBy
object by calling groupby()
on a DataFrame (as discussed in the previous section, "Grouping Data with groupby()"), the next step is to apply a function to each group to calculate a summary statistic. This is the "apply" part of the "split-apply-combine" pattern.
Pandas GroupBy
objects come equipped with several built-in aggregation methods that work very similarly to their counterparts on Series and DataFrames. These methods automatically operate on each group independently and then combine the results into a new Series or DataFrame.
Let's consider a simple DataFrame representing sales data for different products across various regions:
import pandas as pd
import numpy as np
# Sample data
data = {'Region': ['North', 'South', 'North', 'South', 'East', 'East', 'North'],
'Product': ['A', 'A', 'B', 'B', 'A', 'C', 'B'],
'Sales': [100, 150, 200, 50, 120, 80, 180],
'Quantity': [10, 15, 20, 5, 12, 8, 15]}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
Original DataFrame:
Region Product Sales Quantity
0 North A 100 10
1 South A 150 15
2 North B 200 20
3 South B 50 5
4 East A 120 12
5 East C 80 8
6 North B 180 15
Now, let's group this data by 'Region':
grouped_by_region = df.groupby('Region')
The grouped_by_region
object now holds the separated groups, but we haven't computed anything yet.
You can apply common aggregation functions directly to the GroupBy
object. Pandas will intelligently apply the function to the appropriate columns (usually the numeric ones) within each group.
Sum (.sum()
): Calculates the sum of values for each group.
# Calculate total sales and quantity per region
region_totals = grouped_by_region.sum()
print("\nTotal Sales and Quantity per Region:")
print(region_totals)
Total Sales and Quantity per Region:
Sales Quantity
Region
East 200 20
North 480 45
South 200 20
Notice the output is a new DataFrame where the index is the grouping key ('Region'), and the columns are the numeric columns from the original DataFrame ('Sales', 'Quantity') containing the summed values for each region. The non-numeric 'Product' column was automatically excluded from the sum.
Mean (.mean()
): Computes the average value for each group.
# Calculate average sales and quantity per region
region_means = grouped_by_region.mean()
print("\nAverage Sales and Quantity per Region:")
print(region_means)
Average Sales and Quantity per Region:
Sales Quantity
Region
East 100.000000 10.000000
North 160.000000 15.000000
South 100.000000 10.000000
Again, the output index is 'Region', and the values represent the mean of 'Sales' and 'Quantity' for rows belonging to each region.
Count (.count()
): Counts the number of non-null entries in each column for each group.
# Count the number of entries per region for each column
region_counts = grouped_by_region.count()
print("\nCount of Entries per Region:")
print(region_counts)
Count of Entries per Region:
Product Sales Quantity
Region
East 2 2 2
North 3 3 3
South 2 2 2
Here, count()
includes the 'Product' column because it counts any non-missing value, regardless of data type. It shows how many records contributed to each region's group.
Size (.size()
): Returns the total number of rows in each group (including null values, unlike count()
).
# Get the total number of rows (size) per region
region_sizes = grouped_by_region.size()
print("\nSize of Each Region Group:")
print(region_sizes)
Size of Each Region Group:
Region
East 2
North 3
South 2
dtype: int64
The output of size()
is a Pandas Series, where the index is the grouping key ('Region') and the values are the number of rows belonging to that group.
Minimum (.min()
) and Maximum (.max()
): Find the minimum or maximum value within each group for each applicable column.
# Find the minimum sales value in each region
region_min_sales = grouped_by_region['Sales'].min() # Apply to a specific column
print("\nMinimum Sales per Region:")
print(region_min_sales)
# Find the maximum quantity in each region
region_max_quantity = grouped_by_region['Quantity'].max()
print("\nMaximum Quantity per Region:")
print(region_max_quantity)
Minimum Sales per Region:
Region
East 80
North 100
South 50
Name: Sales, dtype: int64
Maximum Quantity per Region:
Region
East 12
North 20
South 15
Name: Quantity, dtype: int64
In these examples, we first selected a specific column (['Sales']
or ['Quantity']
) from the GroupBy
object before applying the aggregation. This results in a Series where the index is the group key ('Region') and the values are the min/max from the selected column for that group. If you apply .min()
or .max()
directly to the GroupBy
object without selecting a column first, it will compute the min/max for all applicable (usually numeric) columns, similar to .sum()
or .mean()
.
As seen with .min()
and .max()
, you can apply aggregation functions to specific columns after grouping. This is useful when you only need summaries for certain features.
# Calculate the total sales per region
total_sales_per_region = df.groupby('Region')['Sales'].sum()
print("\nTotal Sales per Region (Specific Column):")
print(total_sales_per_region)
# Calculate the average quantity per product
avg_quantity_per_product = df.groupby('Product')['Quantity'].mean()
print("\nAverage Quantity per Product:")
print(avg_quantity_per_product)
Total Sales per Region (Specific Column):
Region
East 200
North 480
South 200
Name: Sales, dtype: int64
Average Quantity per Product:
Product
A 12.333333
B 13.333333
C 8.000000
Name: Quantity, dtype: float64
Selecting the column before the aggregation (df.groupby('Region')['Sales'].sum()
) is generally more efficient than calculating aggregations for all columns and then selecting the one you need (df.groupby('Region').sum()['Sales']
), especially with large datasets.
These basic aggregation functions (sum
, mean
, count
, size
, min
, max
, std
, var
, median
, etc.) cover many common data summarization tasks. They form the foundation for understanding group-wise operations in Pandas. In the next section, we will explore how to apply multiple aggregation functions at once.
© 2025 ApX Machine Learning