Grouping and aggregation are fundamental data manipulation concepts. We will work through several examples using a sample dataset to demonstrate the split-apply-combine pattern with Pandas groupby().Setting Up Our Example DataFirst, make sure you have Pandas imported. If you're following along in a Jupyter Notebook or Python script, start with this:import pandas as pd import numpy as np # Often useful alongside PandasNow, let's create a simple DataFrame representing sales data for different product categories across various regions. This will be our data for the exercises.# Create the sample data data = {'Category': ['Electronics', 'Clothing', 'Electronics', 'Groceries', 'Clothing', 'Groceries', 'Electronics', 'Clothing'], 'Region': ['North', 'South', 'North', 'East', 'West', 'East', 'West', 'South'], 'Sales': [1200, 300, 1500, 250, 450, 200, 1800, 350], 'Quantity': [2, 3, 1, 5, 4, 4, 3, 4]} sales_df = pd.DataFrame(data) print("Sample Sales Data:") print(sales_df)This should output:Sample Sales Data: Category Region Sales Quantity 0 Electronics North 1200 2 1 Clothing South 300 3 2 Electronics North 1500 1 3 Groceries East 250 5 4 Clothing West 450 4 5 Groceries East 200 4 6 Electronics West 1800 3 7 Clothing South 350 4Grouping by a Single ColumnOur first task is to find the total sales for each product Category. This involves splitting the DataFrame by Category, applying the sum() function to the Sales column within each group, and combining the results.# Group by 'Category' and calculate the sum of 'Sales' for each category category_sales_sum = sales_df.groupby('Category')['Sales'].sum() print("\nTotal Sales per Category:") print(category_sales_sum)Output:Total Sales per Category: Category Clothing 1100 Electronics 4500 Groceries 450 Name: Sales, dtype: int64Notice the result is a Pandas Series, where the index is the Category we grouped by, and the values are the summed Sales.What if we want the average quantity sold per category?# Group by 'Category' and calculate the mean of 'Quantity' category_quantity_avg = sales_df.groupby('Category')['Quantity'].mean() print("\nAverage Quantity per Category:") print(category_quantity_avg)Output:Average Quantity per Category: Category Clothing 3.666667 Electronics 2.000000 Groceries 4.500000 Name: Quantity, dtype: float64You can apply other aggregation functions similarly, such as count(), size(), min(), max(), std() (standard deviation), etc. Remember count() excludes missing values (NaN), while size() includes them.Applying Multiple AggregationsOften, you'll want to calculate several summary statistics for each group at once. The .agg() method is perfect for this.Let's calculate the total sales (sum) and the average quantity (mean) for each Category.# Group by 'Category' and apply multiple aggregations category_summary = sales_df.groupby('Category').agg( Total_Sales=('Sales', 'sum'), Average_Quantity=('Quantity', 'mean'), Number_of_Sales=('Sales', 'count') # Count non-missing sales entries ) print("\nSummary Statistics per Category:") print(category_summary)Output:Summary Statistics per Category: Total_Sales Average_Quantity Number_of_Sales Category Clothing 1100 3.666667 3 Electronics 4500 2.000000 3 Groceries 450 4.500000 2Here, we passed a dictionary to .agg(). The keys of the dictionary ('Total_Sales', 'Average_Quantity', 'Number_of_Sales') become the new column names in the resulting DataFrame. The values are tuples where the first element is the original column to aggregate ('Sales' or 'Quantity') and the second element is the aggregation function ('sum', 'mean', 'count'). This provides a clear and organized summary.Alternatively, you can pass a list of functions if you want to apply the same set of aggregations to all numerical columns (or a selected column):# Calculate min and max sales for each category category_sales_min_max = sales_df.groupby('Category')['Sales'].agg(['min', 'max']) print("\nMin and Max Sales per Category:") print(category_sales_min_max)Output:Min and Max Sales per Category: min max Category Clothing 300 450 Electronics 1200 1800 Groceries 200 250Grouping by Multiple ColumnsWe can create more granular groups by passing a list of column names to groupby(). Let's find the total sales for each combination of Region and Category.# Group by both 'Region' and 'Category', then sum 'Sales' region_category_sales = sales_df.groupby(['Region', 'Category'])['Sales'].sum() print("\nTotal Sales per Region and Category:") print(region_category_sales)Output:Total Sales per Region and Category: Region Category East Groceries 450 North Electronics 2700 South Clothing 650 West Clothing 450 Electronics 1800 Name: Sales, dtype: int64The result is a Series with a MultiIndex (Region, Category). This hierarchical index represents the combinations of the groups.You can apply multiple aggregations here as well:# Group by 'Region' and 'Category', calculate sum of Sales and mean Quantity region_category_summary = sales_df.groupby(['Region', 'Category']).agg( Total_Sales=('Sales', 'sum'), Average_Quantity=('Quantity', 'mean') ) print("\nSummary per Region and Category:") print(region_category_summary)Output:Summary per Region and Category: Total_Sales Average_Quantity Region Category East Groceries 450 4.5 North Electronics 2700 1.5 South Clothing 650 3.5 West Clothing 450 4.0 Electronics 1800 3.0This gives us a DataFrame with the MultiIndex on the rows.Visualizing Grouped DataAggregated data is often best understood visually. Let's create a simple bar chart of the total sales per category using Plotly.{"data":[{"type":"bar","x":["Clothing","Electronics","Groceries"],"y":[1100,4500,450],"marker":{"color":["#228be6","#ae3ec9","#74b816"]}}],"layout":{"title":{"text":"Total Sales per Product Category"},"xaxis":{"title":{"text":"Category"}},"yaxis":{"title":{"text":"Total Sales ($)"}},"template":"plotly_white","width":600,"height":400}}Total sales generated by each product category. Electronics clearly dominates in this small dataset.Practice TasksNow, try these on your own using the sales_df DataFrame:Find the average Sales for each Region.Count the number of sales records (size()) for each Region.Find the max quantity sold within each Region and Category combination.Calculate the sum of Sales and the sum of Quantity for each Region. Use the .agg() method.These exercises reinforce the different ways you can apply groupby() and aggregation functions to extract meaningful summaries from your data. Mastering grouping is a significant step in data analysis with Pandas.