After loading, cleaning, and selecting data with Pandas, a frequent next step in data analysis and preparation for machine learning involves summarizing data based on certain categories or criteria. This is where grouping and aggregation operations become indispensable. Pandas provides a powerful and flexible groupby
mechanism that allows you to efficiently perform these tasks.
The core idea behind grouping operations often follows a pattern known as split-apply-combine:
Let's explore how this works in Pandas.
groupby()
MethodThe primary tool for grouping in Pandas is the .groupby()
method. When called on a DataFrame, it doesn't immediately compute anything but returns a GroupBy
object. This object contains all the information needed to apply operations to each group.
Consider a sample DataFrame representing sales data:
import pandas as pd
import numpy as np
# Sample sales data
data = {
'Category': ['Electronics', 'Clothing', 'Electronics', 'Groceries', 'Clothing', 'Groceries', 'Electronics', 'Clothing'],
'Region': ['North', 'South', 'North', 'West', 'North', 'South', 'West', 'South'],
'Sales': [1200, 850, 1500, 300, 700, 450, 1800, 950],
'Quantity': [10, 25, 8, 50, 15, 60, 12, 30]
}
df = pd.DataFrame(data)
print(df)
# Output:
# Category Region Sales Quantity
# 0 Electronics North 1200 10
# 1 Clothing South 850 25
# 2 Electronics North 1500 8
# 3 Groceries West 300 50
# 4 Clothing North 700 15
# 5 Groceries South 450 60
# 6 Electronics West 1800 12
# 7 Clothing South 950 30
To group this data by Category
, you would use:
grouped_by_category = df.groupby('Category')
print(grouped_by_category)
# Output: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x...>
This grouped_by_category
object is now ready for the "apply" step. You can inspect the groups it has identified:
# See the indices belonging to each group
print(grouped_by_category.groups)
# Output:
# {'Clothing': [1, 4, 7], 'Electronics': [0, 2, 6], 'Groceries': [3, 5]}
# Get a specific group as a DataFrame
print(grouped_by_category.get_group('Electronics'))
# Output:
# Category Region Sales Quantity
# 0 Electronics North 1200 10
# 2 Electronics North 1500 8
# 6 Electronics West 1800 12
Aggregation involves computing a summary statistic (like sum, mean, count) for each group. Once you have a GroupBy
object, you can apply aggregation functions directly.
To calculate the total sales for each category:
total_sales_per_category = grouped_by_category['Sales'].sum()
print(total_sales_per_category)
# Output:
# Category
# Clothing 2500
# Electronics 4500
# Groceries 750
# Name: Sales, dtype: int64
Here, we first selected the Sales
column from the GroupBy
object and then applied the sum()
aggregation. Pandas automatically aligns the results with the group names.
You can apply various built-in aggregation functions:
count()
: Number of non-null entries in each group.size()
: Total number of entries (including nulls) in each group. Returns a Series.sum()
: Sum of values.mean()
: Average of values.median()
: Median of values.min()
: Minimum value.max()
: Maximum value.std()
: Standard deviation.var()
: Variance.first()
: First value.last()
: Last value.If you apply an aggregation function directly to the GroupBy
object without selecting a column first, Pandas will attempt to apply it to all numeric columns:
# Calculate mean sales and quantity per category
mean_values_per_category = grouped_by_category.mean(numeric_only=True)
print(mean_values_per_category)
# Output:
# Sales Quantity
# Category
# Clothing 833.333333 23.333333
# Electronics 1500.000000 10.000000
# Groceries 375.000000 55.000000
Note the use of numeric_only=True
to avoid warnings or errors on non-numeric columns like 'Region'.
Often, you need to compute multiple summary statistics for each group. The .agg()
method provides a flexible way to do this.
To calculate the sum, mean, and count of sales per category:
sales_summary = grouped_by_category['Sales'].agg(['sum', 'mean', 'count'])
print(sales_summary)
# Output:
# sum mean count
# Category
# Clothing 2500 833.333333 3
# Electronics 4500 1500.000000 3
# Groceries 750 375.000000 2
You can also apply different aggregation functions to different columns by passing a dictionary to .agg()
. The keys are the column names, and the values are the aggregation functions (or lists of functions) to apply to those columns.
# Calculate total sales and average quantity per category
custom_summary = grouped_by_category.agg({
'Sales': 'sum',
'Quantity': 'mean'
})
print(custom_summary)
# Output:
# Sales Quantity
# Category
# Clothing 2500 23.333333
# Electronics 4500 10.000000
# Groceries 750 55.000000
You can even rename the resulting aggregated columns within the .agg()
call:
# Calculate total sales and average quantity, renaming columns
named_summary = grouped_by_category.agg(
TotalSales=('Sales', 'sum'),
AverageQuantity=('Quantity', 'mean')
)
print(named_summary)
# Output:
# TotalSales AverageQuantity
# Category
# Clothing 2500 23.333333
# Electronics 4500 10.000000
# Groceries 750 55.000000
You can group by more than one column by passing a list of column names to .groupby()
. This creates a hierarchical index (MultiIndex) in the result.
# Group by both Category and Region
grouped_multi = df.groupby(['Category', 'Region'])
# Calculate mean sales for each category-region combination
mean_sales_multi = grouped_multi['Sales'].mean()
print(mean_sales_multi)
# Output:
# Category Region
# Clothing North 700.0
# South 900.0
# Electronics North 1350.0
# West 1800.0
# Groceries South 450.0
# West 300.0
# Name: Sales, dtype: float64
To work more easily with the result, you can reset the index:
mean_sales_multi_flat = grouped_multi['Sales'].mean().reset_index()
print(mean_sales_multi_flat)
# Output:
# Category Region Sales
# 0 Clothing North 700.0
# 1 Clothing South 900.0
# 2 Electronics North 1350.0
# 3 Electronics West 1800.0
# 4 Groceries South 450.0
# 5 Groceries West 300.0
apply
While aggregation summarizes group data, sometimes you need to perform more complex, custom operations on each group. The .apply()
method on a GroupBy
object takes a function and applies it to each group (passed as a DataFrame). The results are then combined.
For example, let's define a function to calculate the range (max - min) of sales within each category:
def sales_range(group):
return group['Sales'].max() - group['Sales'].min()
range_per_category = grouped_by_category.apply(sales_range, include_groups=False)
print(range_per_category)
# Output:
# Category
# Clothing 250
# Electronics 600
# Groceries 150
# dtype: int64
Note: The include_groups=False
argument prevents pandas from trying to pass the group keys into the function again, which can sometimes cause issues depending on your function logic. Since pandas 0.23, the default behavior is changing so it's good practice to be explicit.
apply
is very flexible but can be slower than built-in aggregations or transformations because it often involves Python-level loops rather than optimized C implementations.
Beyond aggregation and general application, Pandas offers specialized methods:
.transform()
: Applies a function group-wise but returns an object (Series or DataFrame) that has the same index as the original input. This is useful for operations like standardizing data within groups (e.g., calculating z-scores relative to the group mean and standard deviation)..filter()
: Allows you to discard entire groups based on a group-level computation. For example, you could keep only the categories whose average sales exceed a certain threshold.We won't cover these in detail here, but knowing they exist is useful for more advanced data manipulation tasks.
Grouping and aggregation are fundamental operations for understanding and summarizing structured data. They allow you to derive insights from categorical data, compare characteristics across groups, and prepare features for machine learning models (e.g., creating aggregate features based on user ID or time windows). Mastering groupby
significantly enhances your ability to manipulate and analyze data effectively with Pandas.
© 2025 ApX Machine Learning