CARVIEW |
- Installation
- Documentation
- Getting Started
- Connect
- Data Import
- Overview
- Data Sources
- CSV Files
- JSON Files
- Overview
- Creating JSON
- Loading JSON
- Writing JSON
- JSON Type
- JSON Functions
- Format Settings
- Installing and Loading
- SQL to / from JSON
- Caveats
- Multiple Files
- Parquet Files
- Partitioning
- Appender
- INSERT Statements
- Client APIs
- Overview
- ADBC
- C
- Overview
- Startup
- Configuration
- Query
- Data Chunks
- Vectors
- Values
- Types
- Prepared Statements
- Appender
- Table Functions
- Replacement Scans
- API Reference
- C++
- CLI
- Dart
- Go
- Java (JDBC)
- Julia
- Node.js (Deprecated)
- Node.js (Neo)
- ODBC
- PHP
- Python
- Overview
- Data Ingestion
- Conversion between DuckDB and Python
- DB API
- Relational API
- Function API
- Types API
- Expression API
- Spark API
- API Reference
- Known Python Issues
- R
- Rust
- Swift
- Wasm
- SQL
- Introduction
- Statements
- Overview
- ANALYZE
- ALTER TABLE
- ALTER VIEW
- ATTACH and DETACH
- CALL
- CHECKPOINT
- COMMENT ON
- COPY
- CREATE INDEX
- CREATE MACRO
- CREATE SCHEMA
- CREATE SECRET
- CREATE SEQUENCE
- CREATE TABLE
- CREATE VIEW
- CREATE TYPE
- DELETE
- DESCRIBE
- DROP
- EXPORT and IMPORT DATABASE
- INSERT
- LOAD / INSTALL
- PIVOT
- Profiling
- SELECT
- SET / RESET
- SET VARIABLE
- SUMMARIZE
- Transaction Management
- UNPIVOT
- UPDATE
- USE
- VACUUM
- Query Syntax
- SELECT
- FROM and JOIN
- WHERE
- GROUP BY
- GROUPING SETS
- HAVING
- ORDER BY
- LIMIT and OFFSET
- SAMPLE
- Unnesting
- WITH
- WINDOW
- QUALIFY
- VALUES
- FILTER
- Set Operations
- Prepared Statements
- Data Types
- Overview
- Array
- Bitstring
- Blob
- Boolean
- Date
- Enum
- Interval
- List
- Literal Types
- Map
- NULL Values
- Numeric
- Struct
- Text
- Time
- Timestamp
- Time Zones
- Union
- Typecasting
- Expressions
- Overview
- CASE Expression
- Casting
- Collations
- Comparisons
- IN Operator
- Logical Operators
- Star Expression
- Subqueries
- TRY
- Functions
- Overview
- Aggregate Functions
- Array Functions
- Bitstring Functions
- Blob Functions
- Date Format Functions
- Date Functions
- Date Part Functions
- Enum Functions
- Interval Functions
- Lambda Functions
- List Functions
- Map Functions
- Nested Functions
- Numeric Functions
- Pattern Matching
- Regular Expressions
- Struct Functions
- Text Functions
- Time Functions
- Timestamp Functions
- Timestamp with Time Zone Functions
- Union Functions
- Utility Functions
- Window Functions
- Constraints
- Indexes
- Meta Queries
- DuckDB's SQL Dialect
- Overview
- Indexing
- Friendly SQL
- Keywords and Identifiers
- Order Preservation
- PostgreSQL Compatibility
- SQL Quirks
- Samples
- Configuration
- Extensions
- Overview
- Installing Extensions
- Advanced Installation Methods
- Distributing Extensions
- Versioning of Extensions
- Troubleshooting of Extensions
- Core Extensions
- Overview
- AutoComplete
- Avro
- AWS
- Azure
- Delta
- DuckLake
- Encodings
- Excel
- Full Text Search
- httpfs (HTTP and S3)
- Iceberg
- Overview
- Iceberg REST Catalogs
- Amazon S3 Tables
- Amazon SageMaker Lakehouse (AWS Glue)
- Troubleshooting
- ICU
- inet
- jemalloc
- MySQL
- PostgreSQL
- Spatial
- SQLite
- TPC-DS
- TPC-H
- UI
- VSS
- Guides
- Overview
- Data Viewers
- Database Integration
- File Formats
- Overview
- CSV Import
- CSV Export
- Directly Reading Files
- Excel Import
- Excel Export
- JSON Import
- JSON Export
- Parquet Import
- Parquet Export
- Querying Parquet Files
- File Access with the file: Protocol
- Network and Cloud Storage
- Overview
- HTTP Parquet Import
- S3 Parquet Import
- S3 Parquet Export
- S3 Iceberg Import
- S3 Express One
- GCS Import
- Cloudflare R2 Import
- DuckDB over HTTPS / S3
- Fastly Object Storage Import
- Meta Queries
- Describe Table
- EXPLAIN: Inspect Query Plans
- EXPLAIN ANALYZE: Profile Queries
- List Tables
- Summarize
- DuckDB Environment
- ODBC
- Performance
- Overview
- Environment
- Import
- Schema
- Indexing
- Join Operations
- File Formats
- How to Tune Workloads
- My Workload Is Slow
- Benchmarks
- Working with Huge Databases
- Python
- Installation
- Executing SQL
- Jupyter Notebooks
- marimo Notebooks
- SQL on Pandas
- Import from Pandas
- Export to Pandas
- Import from Numpy
- Export to Numpy
- SQL on Arrow
- Import from Arrow
- Export to Arrow
- Relational API on Pandas
- Multiple Python Threads
- Integration with Ibis
- Integration with Polars
- Using fsspec Filesystems
- SQL Editors
- SQL Features
- Snippets
- Creating Synthetic Data
- Dutch Railway Datasets
- Sharing Macros
- Analyzing a Git Repository
- Importing Duckbox Tables
- Copying an In-Memory Database to a File
- Troubleshooting
- Glossary of Terms
- Browsing Offline
- Operations Manual
- Overview
- DuckDB's Footprint
- Logging
- Securing DuckDB
- Non-Deterministic Behavior
- Limits
- Development
- DuckDB Repositories
- Profiling
- Building DuckDB
- Overview
- Build Configuration
- Building Extensions
- Android
- Linux
- macOS
- Raspberry Pi
- Windows
- Python
- R
- Troubleshooting
- Unofficial and Unsupported Platforms
- Benchmark Suite
- Testing
- Internals
- Why DuckDB
- Code of Conduct
- Release Calendar
- Roadmap
- Sitemap
- Live Demo
GROUPING SETS
, ROLLUP
and CUBE
can be used in the GROUP BY
clause to perform a grouping over multiple dimensions within the same query.
Note that this syntax is not compatible with GROUP BY ALL
.
Examples
Compute the average income along the provided four different dimensions:
-- the syntax () denotes the empty set (i.e., computing an ungrouped aggregate)
SELECT city, street_name, avg(income)
FROM addresses
GROUP BY GROUPING SETS ((city, street_name), (city), (street_name), ());
Compute the average income along the same dimensions:
SELECT city, street_name, avg(income)
FROM addresses
GROUP BY CUBE (city, street_name);
Compute the average income along the dimensions (city, street_name)
, (city)
and ()
:
SELECT city, street_name, avg(income)
FROM addresses
GROUP BY ROLLUP (city, street_name);
Description
GROUPING SETS
perform the same aggregate across different GROUP BY clauses
in a single query.
CREATE TABLE students (course VARCHAR, type VARCHAR);
INSERT INTO students (course, type)
VALUES
('CS', 'Bachelor'), ('CS', 'Bachelor'), ('CS', 'PhD'), ('Math', 'Masters'),
('CS', NULL), ('CS', NULL), ('Math', NULL);
SELECT course, type, count(*)
FROM students
GROUP BY GROUPING SETS ((course, type), course, type, ());
course | type | count_star() |
---|---|---|
Math | NULL | 1 |
NULL | NULL | 7 |
CS | PhD | 1 |
CS | Bachelor | 2 |
Math | Masters | 1 |
CS | NULL | 2 |
Math | NULL | 2 |
CS | NULL | 5 |
NULL | NULL | 3 |
NULL | Masters | 1 |
NULL | Bachelor | 2 |
NULL | PhD | 1 |
In the above query, we group across four different sets: course, type
, course
, type
and ()
(the empty group). The result contains NULL
for a group which is not in the grouping set for the result, i.e., the above query is equivalent to the following statement of UNION ALL
clauses:
-- Group by course, type:
SELECT course, type, count(*)
FROM students
GROUP BY course, type
UNION ALL
-- Group by type:
SELECT NULL AS course, type, count(*)
FROM students
GROUP BY type
UNION ALL
-- Group by course:
SELECT course, NULL AS type, count(*)
FROM students
GROUP BY course
UNION ALL
-- Group by nothing:
SELECT NULL AS course, NULL AS type, count(*)
FROM students;
CUBE
and ROLLUP
are syntactic sugar to easily produce commonly used grouping sets.
The ROLLUP
clause will produce all “sub-groups” of a grouping set, e.g., ROLLUP (country, city, zip)
produces the grouping sets (country, city, zip), (country, city), (country), ()
. This can be useful for producing different levels of detail of a group by clause. This produces n+1
grouping sets where n is the amount of terms in the ROLLUP
clause.
CUBE
produces grouping sets for all combinations of the inputs, e.g., CUBE (country, city, zip)
will produce (country, city, zip), (country, city), (country, zip), (city, zip), (country), (city), (zip), ()
. This produces 2^n
grouping sets.
Identifying Grouping Sets with GROUPING_ID()
The super-aggregate rows generated by GROUPING SETS
, ROLLUP
and CUBE
can often be identified by NULL
-values returned for the respective column in the grouping. But if the columns used in the grouping can themselves contain actual NULL
-values, then it can be challenging to distinguish whether the value in the resultset is a “real” NULL
-value coming out of the data itself, or a NULL
-value generated by the grouping construct. The GROUPING_ID()
or GROUPING()
function is designed to identify which groups generated the super-aggregate rows in the result.
GROUPING_ID()
is an aggregate function that takes the column expressions that make up the grouping(s). It returns a BIGINT
value. The return value is 0
for the rows that are not super-aggregate rows. But for the super-aggregate rows, it returns an integer value that identifies the combination of expressions that make up the group for which the super-aggregate is generated. At this point, an example might help. Consider the following query:
WITH days AS (
SELECT
year("generate_series") AS y,
quarter("generate_series") AS q,
month("generate_series") AS m
FROM generate_series(DATE '2023-01-01', DATE '2023-12-31', INTERVAL 1 DAY)
)
SELECT y, q, m, GROUPING_ID(y, q, m) AS "grouping_id()"
FROM days
GROUP BY GROUPING SETS (
(y, q, m),
(y, q),
(y),
()
)
ORDER BY y, q, m;
These are the results:
y | q | m | grouping_id() |
---|---|---|---|
2023 | 1 | 1 | 0 |
2023 | 1 | 2 | 0 |
2023 | 1 | 3 | 0 |
2023 | 1 | NULL | 1 |
2023 | 2 | 4 | 0 |
2023 | 2 | 5 | 0 |
2023 | 2 | 6 | 0 |
2023 | 2 | NULL | 1 |
2023 | 3 | 7 | 0 |
2023 | 3 | 8 | 0 |
2023 | 3 | 9 | 0 |
2023 | 3 | NULL | 1 |
2023 | 4 | 10 | 0 |
2023 | 4 | 11 | 0 |
2023 | 4 | 12 | 0 |
2023 | 4 | NULL | 1 |
2023 | NULL | NULL | 3 |
NULL | NULL | NULL | 7 |
In this example, the lowest level of grouping is at the month level, defined by the grouping set (y, q, m)
. Result rows corresponding to that level are simply aggregate rows and the GROUPING_ID(y, q, m)
function returns 0
for those. The grouping set (y, q)
results in super-aggregate rows over the month level, leaving a NULL
-value for the m
column, and for which GROUPING_ID(y, q, m)
returns 1
. The grouping set (y)
results in super-aggregate rows over the quarter level, leaving NULL
-values for the m
and q
column, for which GROUPING_ID(y, q, m)
returns 3
. Finally, the ()
grouping set results in one super-aggregate row for the entire resultset, leaving NULL
-values for y
, q
and m
and for which GROUPING_ID(y, q, m)
returns 7
.
To understand the relationship between the return value and the grouping set, you can think of GROUPING_ID(y, q, m)
writing to a bitfield, where the first bit corresponds to the last expression passed to GROUPING_ID()
, the second bit to the one-but-last expression passed to GROUPING_ID()
, and so on. This may become clearer by casting GROUPING_ID()
to BIT
:
WITH days AS (
SELECT
year("generate_series") AS y,
quarter("generate_series") AS q,
month("generate_series") AS m
FROM generate_series(DATE '2023-01-01', DATE '2023-12-31', INTERVAL 1 DAY)
)
SELECT
y, q, m,
GROUPING_ID(y, q, m) AS "grouping_id(y, q, m)",
right(GROUPING_ID(y, q, m)::BIT::VARCHAR, 3) AS "y_q_m_bits"
FROM days
GROUP BY GROUPING SETS (
(y, q, m),
(y, q),
(y),
()
)
ORDER BY y, q, m;
Which returns these results:
y | q | m | grouping_id(y, q, m) | y_q_m_bits |
---|---|---|---|---|
2023 | 1 | 1 | 0 | 000 |
2023 | 1 | 2 | 0 | 000 |
2023 | 1 | 3 | 0 | 000 |
2023 | 1 | NULL | 1 | 001 |
2023 | 2 | 4 | 0 | 000 |
2023 | 2 | 5 | 0 | 000 |
2023 | 2 | 6 | 0 | 000 |
2023 | 2 | NULL | 1 | 001 |
2023 | 3 | 7 | 0 | 000 |
2023 | 3 | 8 | 0 | 000 |
2023 | 3 | 9 | 0 | 000 |
2023 | 3 | NULL | 1 | 001 |
2023 | 4 | 10 | 0 | 000 |
2023 | 4 | 11 | 0 | 000 |
2023 | 4 | 12 | 0 | 000 |
2023 | 4 | NULL | 1 | 001 |
2023 | NULL | NULL | 3 | 011 |
NULL | NULL | NULL | 7 | 111 |
Note that the number of expressions passed to GROUPING_ID()
, or the order in which they are passed is independent from the actual group definitions appearing in the GROUPING SETS
-clause (or the groups implied by ROLLUP
and CUBE
). As long as the expressions passed to GROUPING_ID()
are expressions that appear some where in the GROUPING SETS
-clause, GROUPING_ID()
will set a bit corresponding to the position of the expression whenever that expression is rolled up to a super-aggregate.