You've learned how to use aggregate functions like COUNT()
, SUM()
, and AVG()
to compute summary statistics across all rows returned by your query. While getting a single number like the total number of orders or the average product price across the entire catalog is useful, often you need more granular summaries. For instance, instead of the overall average product price, you might want the average price for each product category. Or, instead of the total number of customers, you might want the number of customers in each city.
This is where the GROUP BY
clause comes in. It allows you to divide the rows of your table into smaller groups based on the values in one or more columns. The aggregate functions are then applied independently to each of these groups.
Think of GROUP BY
as a process that reorganizes your data before aggregation happens:
FROM
and WHERE
clauses into groups. All rows within a group have the same value in the column(s) specified in the GROUP BY
clause.SELECT
clause (like COUNT()
, AVG()
, SUM()
, MIN()
, MAX()
) are then applied to each group separately.The basic syntax looks like this:
SELECT
column_to_group_by,
aggregate_function(column_to_aggregate)
FROM
table_name
WHERE
-- Optional: filter rows BEFORE grouping
condition
GROUP BY
column_to_group_by
ORDER BY
-- Optional: sort the grouped results
column_to_group_by; -- or aggregate_function result
Notice the order: GROUP BY
comes after FROM
and WHERE
, but before ORDER BY
.
Let's use a hypothetical Orders
table containing information about customer orders:
order_id | customer_id | order_date | order_total |
---|---|---|---|
101 | 1 | 2023-10-01 | 50.00 |
102 | 2 | 2023-10-01 | 120.50 |
103 | 1 | 2023-10-05 | 75.25 |
104 | 3 | 2023-10-06 | 30.00 |
105 | 2 | 2023-10-08 | 80.00 |
106 | 1 | 2023-10-10 | 45.75 |
Suppose we want to find the total amount spent by each customer. We need to group the rows by customer_id
and then apply the SUM()
function to the order_total
for each group.
SELECT
customer_id,
SUM(order_total) AS total_spent
FROM
Orders
GROUP BY
customer_id;
Here's how SQL processes this:
Orders
table.customer_id
:
customer_id
= 1 (orders 101, 103, 106)customer_id
= 2 (orders 102, 105)customer_id
= 3 (order 104)SUM(order_total)
for each group:
customer_id | total_spent |
---|---|
1 | 171.00 |
2 | 200.50 |
3 | 30.00 |
We could also find the number of orders per customer using COUNT(*)
:
SELECT
customer_id,
COUNT(*) AS number_of_orders
FROM
Orders
GROUP BY
customer_id;
This would return:
customer_id | number_of_orders |
---|---|
1 | 3 |
2 | 2 |
3 | 1 |
SELECT
List Rule with GROUP BY
There's an important rule when using GROUP BY
: Any column in the SELECT
list that is not an aggregate function must be included in the GROUP BY
clause.
Why? Consider the query SELECT customer_id, SUM(order_total) FROM Orders GROUP BY customer_id;
. This works because for each group (defined by a single customer_id
), there's a single customer_id
value and a single SUM(order_total)
result.
Now, imagine you tried this:
-- This query is usually INVALID in standard SQL
SELECT
customer_id,
order_date, -- Not aggregated, not in GROUP BY
SUM(order_total) AS total_spent
FROM
Orders
GROUP BY
customer_id;
For customer_id
1, there are multiple order_date
values ('2023-10-01', '2023-10-05', '2023-10-10'). Since the query collapses these three rows into a single output row for customer_id
1, SQL doesn't know which order_date
to display. To avoid this ambiguity, SQL requires any non-aggregated column in the SELECT
list to also be in the GROUP BY
clause. If you put order_date
in the GROUP BY
, you'd be grouping by the combination of customer_id
and order_date
, which is a different calculation.
You are not limited to grouping by just one column. You can specify multiple columns in the GROUP BY
clause to create finer-grained groups based on the unique combination of values in those columns.
Let's add a product_category
column to our Orders
table (assuming each order is for one category for simplicity):
order_id | customer_id | product_category | order_total |
---|---|---|---|
101 | 1 | Books | 50.00 |
102 | 2 | Electronics | 120.50 |
103 | 1 | Groceries | 75.25 |
104 | 3 | Books | 30.00 |
105 | 2 | Groceries | 80.00 |
106 | 1 | Books | 45.75 |
Now, let's find the total spent by each customer within each product category:
SELECT
customer_id,
product_category,
SUM(order_total) AS category_total_spent
FROM
Orders
GROUP BY
customer_id, product_category -- Group by the combination
ORDER BY
customer_id, product_category; -- Optional sorting
SQL now forms groups based on unique pairs of (customer_id, product_category)
:
The result would be:
customer_id | product_category | category_total_spent |
---|---|---|
1 | Books | 95.75 |
1 | Groceries | 75.25 |
2 | Electronics | 120.50 |
2 | Groceries | 80.00 |
3 | Books | 30.00 |
The diagram illustrates how
GROUP BY customer_id, product_category
partitions the original rows based on unique combinations of customer and category, allowingSUM(order_total)
to be calculated for each distinct pair.
WHERE
Remember that the WHERE
clause filters individual rows before the GROUP BY
operation takes place. If you want to calculate aggregates only for a subset of your original data, use WHERE
. For example, to find the total spent per customer, but only considering orders placed after '2023-10-04':
SELECT
customer_id,
SUM(order_total) AS total_spent_recent
FROM
Orders
WHERE
order_date > '2023-10-04' -- Filter rows first
GROUP BY
customer_id; -- Group remaining rows
This query first removes orders 101 and 102, and then groups the remaining rows (103, 104, 105, 106) by customer_id
before summing the totals.
The GROUP BY
clause is fundamental for creating meaningful summaries from your data. It works hand-in-hand with aggregate functions to collapse multiple rows into informative summary statistics for distinct groups within your dataset. Next, we'll look at how to filter these grouped results using the HAVING
clause.
© 2025 ApX Machine Learning