As introduced in the chapter overview, the "split-apply-combine" strategy is a powerful pattern for data analysis. The first step in this process is splitting the data into groups based on some criteria. In Pandas, this is primarily achieved using the groupby()
method.
Think of groupby()
as a way of telling Pandas: "Take this DataFrame and separate the rows into different buckets, where each bucket contains rows that have the same value in a specific column (or columns)." This operation doesn't immediately change the DataFrame or display the separated data. Instead, it creates a special intermediate object called a GroupBy
object. This object holds all the necessary information about the groups and is ready for the next step: applying a function (like calculating a sum or mean) to each group.
groupby()
MethodThe basic syntax is straightforward: you call the groupby()
method on your DataFrame and pass the name of the column (or a list of column names) you want to group by.
Let's illustrate with an example. Imagine we have a small dataset tracking sales for different products across various regions.
import pandas as pd
# Sample sales data
data = {'Region': ['North', 'South', 'North', 'South', 'East', 'East', 'North'],
'Product': ['A', 'A', 'B', 'B', 'A', 'B', 'A'],
'Sales': [100, 150, 200, 50, 120, 80, 90]}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
Output:
Original DataFrame:
Region Product Sales
0 North A 100
1 South A 150
2 North B 200
3 South B 50
4 East A 120
5 East B 80
6 North A 90
Now, let's group this DataFrame by the 'Region' column:
# Group by the 'Region' column
grouped_by_region = df.groupby('Region')
print("\nResult of groupby('Region'):")
print(grouped_by_region)
Output:
Result of groupby('Region'):
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x...>
Notice that the output isn't the grouped data itself. It's a DataFrameGroupBy
object. This object essentially contains multiple mini-DataFrames, one for each unique value in the 'Region' column ('North', 'South', 'East').
Representation of applying
groupby('Region')
to the DataFrame. The method creates aGroupBy
object which holds references to the subsets of the original DataFrame for each unique region.
This GroupBy
object is the cornerstone of the split-apply-combine process. While it doesn't show much on its own, it's ready for the 'apply' step. You can perform various operations on it, such as:
For now, the important takeaway is that df.groupby('column_name')
efficiently splits your DataFrame based on the specified column(s) and creates a GroupBy
object, setting the stage for subsequent analysis on each group.
© 2025 ApX Machine Learning