Let’s motivate grouping sets with an example. Both the dataset and prompt are sourced from pgexercises - an excellent resource for learning and practicing Postgres-flavoured SQL. I’ll also be using DuckDB along the way (it too is Postgres-flavoured).
We’ve got a country club which has a couple of facilities such as tennis courts
and pool tables. We’re using a database to keep track of members and bookings.
Members (and guests, at a slightly higher rate) can book facilities. The
duration for a given booking is kept track of using a starttime
and the number
of half-hour slots
.
We want to know:
- For each facility:
- the number of slots booked for the entire month
- the number of slots booked for the entire year (2012)
- The total number of slots booked across the entire country club for that year
In other words, the result should have the following shape:
Facility | Month | Slots |
---|---|---|
Tennis Court 2 | August | 483 |
September | 588 | |
1071 | ||
Badminton Court | August | 459 |
September | 579 | |
1029 | ||
Total | 2100 |
As you can observe, we are aggregating across multiple hierarchies or ‘zoom levels’.
The most straightforward approach then is to carry out the aggregations separately then union them all together:
select
facility,
month,
sum(slots) as slots
from bookings_2012
group by facility, month
union all
select
facility,
null,
sum(slots) as slots
from bookings_2012
group by facility
union all
select
null,
null,
sum(slots) as slots
from bookings_2012
group by facility
order by facility, month
Note that we use union all
instead of union
since we don’t expect (or mind)
any duplicate rows.
This query could be made less verbose by using a CTE - again, credits to pgexercises:
with t as (
select
facility,
month,
slots
from bookings_2012
)
select facilty, month, sum(slots) from t group by facility, month
union all
select facility, null, sum(slots) from t group by facility
union all
select null, null, sum(slots) from t
order by facility, month;
Rollup #
Better yet, the above query could be simplified by using rollup
:
select
facility,
month,
sum(slots)
from bookings_2012
group by rollup(facilty, month)
order by facility, month
With all three versions, we get the same result:
┌─────────────────┬───────────┬────────┐
│ facility │ month │ slots │
│ varchar │ varchar │ int128 │
├─────────────────┼───────────┼────────┤
│ Tennis Court 2 │ August │ 483 │
│ Tennis Court 2 │ September │ 588 │
│ Tennis Court 2 │ │ 1071 │
│ Badminton Court │ August │ 459 │
│ Badminton Court │ September │ 570 │
│ Badminton Court │ │ 1029 │
│ │ │ 2100 │
└─────────────────┴───────────┴────────┘
Rollup is used to aggregate across levels in the order they are passed.
Therefore, with facility, month
we get the levels in the following order:
(facility, month)
: sum the slots for each facility, month pair(facility)
: sum the slots for each facility()
: sum all slots
If we switch the order i.e. to (month, facility)
, we get a result with the
following shape:
Facility | Month | Slots |
---|---|---|
August | Tennis Court 2 | 483 |
Badminton Court | 459 | |
942 | ||
September | Tennis Court 2 | 588 |
Badminton Court | 570 | |
1158 | ||
Total | 2100 |
Grouping set #
Rollups are in fact syntactic sugar for the more generic grouping sets
clause.
Therefore, group by rollup(facility, month)
expands to:
group by grouping sets(
(facility, month),
(facility),
(),
)
Grouping sets were added to SQL since with the plain group by, you can only specify a single grouping to aggregate across.
However, as the previous example demonstrates you may need to aggregate across
multiple kinds of groups in parallel - hence grouping sets
in SQL.
Grouping sets give as more flexibility. For example, if we don’t need the year’s sum we can discard it:
select
facility,
month,
sum(slots) as slots
from bookings_2012
group by grouping sets ((facility, month), (facility))
The result is:
┌─────────────────┬───────────┬────────┐
│ facility │ month │ slots │
│ varchar │ varchar │ int128 │
├─────────────────┼───────────┼────────┤
│ Tennis Court 2 │ August │ 483 │
│ Tennis Court 2 │ September │ 588 │
│ Tennis Court 2 │ │ 1071 │
│ Badminton Court │ August │ 459 │
│ Badminton Court │ September │ 570 │
│ Badminton Court │ │ 1029 │
└─────────────────┴───────────┴────────┘
Cube #
There’s also cube
which we can use to aggregate across all possible
permutations of the given columns and expressions (i.e. the powerset). For
example, from the Postgres docs, group by cube(a,b,c)
is equivalent to:
grouping sets (
( a, b, c ),
( a, b ),
( a, c ),
( a ),
( b, c ),
( b ),
( c ),
( )
)
Grouping aggregate function #
As an aside, consider the following case: if a column within that grouping set
has a null, it cannot be distinguished from the columns outside the group (since
those are also filled with nulls). Therefore, both Postgres and DuckDB provide
the grouping
function. In duckdb, it also has the alias grouping_id
. This
function takes in a group and returns 0 if that row is in the given group or 1
otherwise.
For example, given the following query:
select
facility,
month,
grouping_id(month) as in_group_month,
sum(slots)
from bookings_2012
group by grouping sets (facility, month)
order by facility, month
We get:
┌─────────────────┬───────────┬────────────────┬────────────┐
│ facility │ month │ in_group_month │ sum(slots) │
│ varchar │ varchar │ int64 │ int128 │
├─────────────────┼───────────┼────────────────┼────────────┤
│ Badminton Court │ │ 1 │ 1209 │
│ Tennis Court 2 │ │ 1 │ 1278 │
│ │ August │ 0 │ 942 │
│ │ July │ 0 │ 387 │
│ │ September │ 0 │ 1158 │
└─────────────────┴───────────┴────────────────┴────────────┘
In summary #
Do check out the Postgres documentation for more details on grouping sets
.
It’s quite a neat addition to SQL that saves us from both verbose queries and
having to merge results client-side for certain kinds of reporting.
References #
- PostgreSQL Exercises - Alisdair Owens: pgexercises
- PostgreSQL Documentation - GROUPING SETS, CUBE and ROLLUP: docs
- DuckDB Documentation - GROUPING SETS: link
- PostgreSQL GROUPING SETS: Postgres Tutorial
- The Art of Postgres - Grouping Sets - Dimitri Fontaine