CARVIEW |
- Installation
- Documentation
- Getting Started
- Connect
- Data Import
- Overview
- Data Sources
- CSV Files
- JSON Files
- Overview
- Creating JSON
- Loading JSON
- Writing JSON
- JSON Type
- JSON Functions
- Format Settings
- Installing and Loading
- SQL to / from JSON
- Caveats
- Multiple Files
- Parquet Files
- Partitioning
- Appender
- INSERT Statements
- Client APIs
- Overview
- Tertiary Clients
- ADBC
- C
- Overview
- Startup
- Configuration
- Query
- Data Chunks
- Vectors
- Values
- Types
- Prepared Statements
- Appender
- Table Functions
- Replacement Scans
- API Reference
- C++
- CLI
- Overview
- Arguments
- Dot Commands
- Output Formats
- Editing
- Safe Mode
- Autocomplete
- Syntax Highlighting
- Known Issues
- Dart
- Go
- Java (JDBC)
- Julia
- Node.js (Deprecated)
- Node.js (Neo)
- ODBC
- PHP
- Python
- Overview
- Data Ingestion
- Conversion between DuckDB and Python
- DB API
- Relational API
- Function API
- Types API
- Expression API
- Spark API
- API Reference
- Known Python Issues
- R
- Rust
- Swift
- Wasm
- SQL
- Introduction
- Statements
- Overview
- ANALYZE
- ALTER DATABASE
- ALTER TABLE
- ALTER VIEW
- ATTACH and DETACH
- CALL
- CHECKPOINT
- COMMENT ON
- COPY
- CREATE INDEX
- CREATE MACRO
- CREATE SCHEMA
- CREATE SECRET
- CREATE SEQUENCE
- CREATE TABLE
- CREATE VIEW
- CREATE TYPE
- DELETE
- DESCRIBE
- DROP
- EXPORT and IMPORT DATABASE
- INSERT
- LOAD / INSTALL
- MERGE INTO
- PIVOT
- Profiling
- SELECT
- SET / RESET
- SET VARIABLE
- SUMMARIZE
- Transaction Management
- UNPIVOT
- UPDATE
- USE
- VACUUM
- Query Syntax
- SELECT
- FROM and JOIN
- WHERE
- GROUP BY
- GROUPING SETS
- HAVING
- ORDER BY
- LIMIT and OFFSET
- SAMPLE
- Unnesting
- WITH
- WINDOW
- QUALIFY
- VALUES
- FILTER
- Set Operations
- Prepared Statements
- Data Types
- Overview
- Array
- Bitstring
- Blob
- Boolean
- Date
- Enum
- Interval
- List
- Literal Types
- Map
- NULL Values
- Numeric
- Struct
- Text
- Time
- Timestamp
- Time Zones
- Union
- Typecasting
- Expressions
- Overview
- CASE Expression
- Casting
- Collations
- Comparisons
- IN Operator
- Logical Operators
- Star Expression
- Subqueries
- TRY
- Functions
- Overview
- Aggregate Functions
- Array Functions
- Bitstring Functions
- Blob Functions
- Date Format Functions
- Date Functions
- Date Part Functions
- Enum Functions
- Interval Functions
- Lambda Functions
- List Functions
- Map Functions
- Nested Functions
- Numeric Functions
- Pattern Matching
- Regular Expressions
- Struct Functions
- Text Functions
- Time Functions
- Timestamp Functions
- Timestamp with Time Zone Functions
- Union Functions
- Utility Functions
- Window Functions
- Constraints
- Indexes
- Meta Queries
- DuckDB's SQL Dialect
- Overview
- Indexing
- Friendly SQL
- Keywords and Identifiers
- Order Preservation
- PostgreSQL Compatibility
- SQL Quirks
- Samples
- Configuration
- Extensions
- Overview
- Installing Extensions
- Advanced Installation Methods
- Distributing Extensions
- Versioning of Extensions
- Troubleshooting of Extensions
- Core Extensions
- Overview
- AutoComplete
- Avro
- AWS
- Azure
- Delta
- DuckLake
- Encodings
- Excel
- Full Text Search
- httpfs (HTTP and S3)
- Iceberg
- Overview
- Iceberg REST Catalogs
- Amazon S3 Tables
- Amazon SageMaker Lakehouse (AWS Glue)
- Troubleshooting
- ICU
- inet
- jemalloc
- MySQL
- PostgreSQL
- Spatial
- SQLite
- TPC-DS
- TPC-H
- UI
- VSS
- Guides
- Overview
- Data Viewers
- Database Integration
- File Formats
- Overview
- CSV Import
- CSV Export
- Directly Reading Files
- Excel Import
- Excel Export
- JSON Import
- JSON Export
- Parquet Import
- Parquet Export
- Querying Parquet Files
- File Access with the file: Protocol
- Network and Cloud Storage
- Overview
- HTTP Parquet Import
- S3 Parquet Import
- S3 Parquet Export
- S3 Iceberg Import
- S3 Express One
- GCS Import
- Cloudflare R2 Import
- DuckDB over HTTPS / S3
- Fastly Object Storage Import
- Meta Queries
- Describe Table
- EXPLAIN: Inspect Query Plans
- EXPLAIN ANALYZE: Profile Queries
- List Tables
- Summarize
- DuckDB Environment
- ODBC
- Performance
- Overview
- Environment
- Import
- Schema
- Indexing
- Join Operations
- File Formats
- How to Tune Workloads
- My Workload Is Slow
- Benchmarks
- Working with Huge Databases
- Python
- Installation
- Executing SQL
- Jupyter Notebooks
- marimo Notebooks
- SQL on Pandas
- Import from Pandas
- Export to Pandas
- Import from Numpy
- Export to Numpy
- SQL on Arrow
- Import from Arrow
- Export to Arrow
- Relational API on Pandas
- Multiple Python Threads
- Integration with Ibis
- Integration with Polars
- Using fsspec Filesystems
- SQL Editors
- SQL Features
- Snippets
- Creating Synthetic Data
- Dutch Railway Datasets
- Sharing Macros
- Analyzing a Git Repository
- Importing Duckbox Tables
- Copying an In-Memory Database to a File
- Troubleshooting
- Glossary of Terms
- Browsing Offline
- Operations Manual
- Overview
- DuckDB's Footprint
- Logging
- Securing DuckDB
- Non-Deterministic Behavior
- Limits
- DuckDB Docker Container
- Development
- DuckDB Repositories
- Profiling
- Building DuckDB
- Overview
- Build Configuration
- Building Extensions
- Android
- Linux
- macOS
- Raspberry Pi
- Windows
- Python
- R
- Troubleshooting
- Unofficial and Unsupported Platforms
- Benchmark Suite
- Testing
- Internals
- Sitemap
- Live Demo
You can use DuckDB to analyze Git logs using the output of the git log
command.
Exporting the Git Log
We start by picking a character that doesn't occur in any part of the commit log (author names, messages, etc). Since version v1.2.0, DuckDB's CSV reader supports 4-byte delimiters, making it possible to use emojis! π
Despite being featured in the Emoji Movie (IMDb rating: 3.4),
we can assume that the Fish Cake with Swirl emoji (π₯) is not a common occurrence in most Git logs.
So, let's clone the duckdb/duckdb
repository and export its log as follows:
git log --date=iso-strict --pretty=format:%adπ₯%hπ₯%anπ₯%s > git-log.csv
The resulting file looks like this:
2025-02-25T18:12:54+01:00π₯d608a31e13π₯Markπ₯MAIN_BRANCH_VERSIONING: Adopt also for Python build and amalgamation (#16400)
2025-02-25T15:05:56+01:00π₯920b39ad96π₯Markπ₯Read support for Parquet Float16 (#16395)
2025-02-25T13:43:52+01:00π₯61f55734b9π₯Carlo Piovesanπ₯MAIN_BRANCH_VERSIONING: Adopt also for Python build and amalgamation
2025-02-25T12:35:28+01:00π₯87eff7ebd3π₯Markπ₯Fix issue #16377 (#16391)
2025-02-25T10:33:49+01:00π₯35af26476eπ₯Hannes MΓΌhleisenπ₯Read support for Parquet Float16
Loading the Git Log into DuckDB
Start DuckDB and read the log as a CSV π₯SV:
CREATE TABLE commits AS
FROM read_csv(
'git-log.csv',
delim = 'π₯',
header = false,
column_names = ['timestamp', 'hash', 'author', 'message']
);
This will result in a nice DuckDB table:
FROM commits
LIMIT 5;
βββββββββββββββββββββββ¬βββββββββββββ¬βββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β timestamp β hash β author β message β
β timestamp β varchar β varchar β varchar β
βββββββββββββββββββββββΌβββββββββββββΌβββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 2025-02-25 17:12:54 β d608a31e13 β Mark β MAIN_BRANCH_VERSIONING: Adopt also for Python build and amalgamation (#16400) β
β 2025-02-25 14:05:56 β 920b39ad96 β Mark β Read support for Parquet Float16 (#16395) β
β 2025-02-25 12:43:52 β 61f55734b9 β Carlo Piovesan β MAIN_BRANCH_VERSIONING: Adopt also for Python build and amalgamation β
β 2025-02-25 11:35:28 β 87eff7ebd3 β Mark β Fix issue #16377 (#16391) β
β 2025-02-25 09:33:49 β 35af26476e β Hannes MΓΌhleisen β Read support for Parquet Float16 β
βββββββββββββββββββββββ΄βββββββββββββ΄βββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Analyzing the Log
We can analyze the table as any other in DuckDB.
Common Topics
Let's start with a simple question: which topic was the most commonly mentioned in the commit messages: CI, CLI, or Python?
SELECT
message.lower().regexp_extract('\b(ci|cli|python)\b') AS topic,
count(*) AS num_commits
FROM commits
WHERE topic <> ''
GROUP BY ALL
ORDER BY num_commits DESC;
βββββββββββ¬ββββββββββββββ
β topic β num_commits β
β varchar β int64 β
βββββββββββΌββββββββββββββ€
β ci β 828 β
β python β 666 β
β cli β 49 β
βββββββββββ΄ββββββββββββββ
Out of these three topics, commits related to continuous integration dominate the log!
We can also do a more exploratory analysis by looking at all words in the commit messages. To do so, we first tokenize the messages:
CREATE TABLE words AS
SELECT unnest(
message
.lower()
.regexp_replace('\W', ' ')
.trim(' ')
.string_split_regex('\W')
) AS word
FROM commits;
Then, we remove stopwords using a pre-defined list:
CREATE TABLE stopwords AS
SELECT unnest(['a', 'about', 'above', 'after', 'again', 'against', 'all', 'am', 'an', 'and', 'any', 'are', 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'did', 'do', 'does', 'doing', 'don', 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'has', 'have', 'having', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in', 'into', 'is', 'it', 'its', 'itself', 'just', 'me', 'more', 'most', 'my', 'myself', 'no', 'nor', 'not', 'now', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 's', 'same', 'she', 'should', 'so', 'some', 'such', 't', 'than', 'that', 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', 'these', 'they', 'this', 'those', 'through', 'to', 'too', 'under', 'until', 'up', 'very', 'was', 'we', 'were', 'what', 'when', 'where', 'which', 'while', 'who', 'whom', 'why', 'will', 'with', 'you', 'your', 'yours', 'yourself', 'yourselves']) AS word;
CREATE OR REPLACE TABLE words AS
FROM words
NATURAL ANTI JOIN stopwords
WHERE word != '';
We use the
NATURAL ANTI JOIN
clause here, which allows us to elegantly filter out values that occur in thestopwords
table.
Finally, we select the top-20 most common words.
SELECT word, count(*) AS count FROM words
GROUP BY ALL
ORDER BY count DESC
LIMIT 20;
ββββββββββββ¬ββββββββ
β w β count β
β varchar β int64 β
ββββββββββββΌββββββββ€
β merge β 12550 β
β fix β 6402 β
β branch β 6005 β
β pull β 5950 β
β request β 5945 β
β add β 5687 β
β test β 3801 β
β master β 3289 β
β tests β 2339 β
β issue β 1971 β
β main β 1935 β
β remove β 1884 β
β format β 1819 β
β duckdb β 1710 β
β use β 1442 β
β mytherin β 1410 β
β fixes β 1333 β
β hawkfish β 1147 β
β feature β 1139 β
β function β 1088 β
ββββββββββββ΄ββββββββ€
β 20 rows β
ββββββββββββββββββββ
As expected, there are many Git terms (merge
, branch
, pull
, etc.), followed by terminology related to development (fix
, test
/tests
, issue
, format
).
We also see the account names of some developers (mytherin
, hawkfish
), which are likely there due to commit message for merging pull requests (e.g., βMerge pull request #13776 from Mytherin/expressiondepthβ).
Finally, we also see some DuckDB-related terms such as duckdb
(shocking!) and function
.
Visualizing the Number of Commits
Let's visualize the number of commits each year:
SELECT
year(timestamp) AS year,
count(*) AS num_commits,
num_commits.bar(0, 20_000) AS num_commits_viz
FROM commits
GROUP BY ALL
ORDER BY ALL;
βββββββββ¬ββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β year β num_commits β num_commits_viz β
β int64 β int64 β varchar β
βββββββββΌββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 2018 β 870 β ββββ β
β 2019 β 1621 β βββββββ β
β 2020 β 3484 β ββββββββββββββ β
β 2021 β 6488 β ββββββββββββββββββββββββββ β
β 2022 β 9817 β ββββββββββββββββββββββββββββββββββββββββ β
β 2023 β 14585 β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β 2024 β 15949 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β 2025 β 1788 β ββββββββ β
βββββββββ΄ββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
We see a steady growth over the years β especially considering that many of DuckDB's functionalities and clients, which were originally part of the main repository, are now maintained in separate repositories (e.g., Java, R).
Happy hacking!