CARVIEW |
- Installation
- Documentation
- Getting Started
- Connect
- Data Import
- Overview
- Data Sources
- CSV Files
- JSON Files
- Overview
- Creating JSON
- Loading JSON
- Writing JSON
- JSON Type
- JSON Functions
- Format Settings
- Installing and Loading
- SQL to / from JSON
- Caveats
- Multiple Files
- Parquet Files
- Partitioning
- Appender
- INSERT Statements
- Client APIs
- Overview
- ADBC
- C
- Overview
- Startup
- Configuration
- Query
- Data Chunks
- Vectors
- Values
- Types
- Prepared Statements
- Appender
- Table Functions
- Replacement Scans
- API Reference
- C++
- CLI
- Dart
- Go
- Java (JDBC)
- Julia
- Node.js (Deprecated)
- Node.js (Neo)
- ODBC
- PHP
- Python
- Overview
- Data Ingestion
- Conversion between DuckDB and Python
- DB API
- Relational API
- Function API
- Types API
- Expression API
- Spark API
- API Reference
- Known Python Issues
- R
- Rust
- Swift
- Wasm
- SQL
- Introduction
- Statements
- Overview
- ANALYZE
- ALTER TABLE
- ALTER VIEW
- ATTACH and DETACH
- CALL
- CHECKPOINT
- COMMENT ON
- COPY
- CREATE INDEX
- CREATE MACRO
- CREATE SCHEMA
- CREATE SECRET
- CREATE SEQUENCE
- CREATE TABLE
- CREATE VIEW
- CREATE TYPE
- DELETE
- DESCRIBE
- DROP
- EXPORT and IMPORT DATABASE
- INSERT
- LOAD / INSTALL
- PIVOT
- Profiling
- SELECT
- SET / RESET
- SET VARIABLE
- SUMMARIZE
- Transaction Management
- UNPIVOT
- UPDATE
- USE
- VACUUM
- Query Syntax
- SELECT
- FROM and JOIN
- WHERE
- GROUP BY
- GROUPING SETS
- HAVING
- ORDER BY
- LIMIT and OFFSET
- SAMPLE
- Unnesting
- WITH
- WINDOW
- QUALIFY
- VALUES
- FILTER
- Set Operations
- Prepared Statements
- Data Types
- Overview
- Array
- Bitstring
- Blob
- Boolean
- Date
- Enum
- Interval
- List
- Literal Types
- Map
- NULL Values
- Numeric
- Struct
- Text
- Time
- Timestamp
- Time Zones
- Union
- Typecasting
- Expressions
- Overview
- CASE Expression
- Casting
- Collations
- Comparisons
- IN Operator
- Logical Operators
- Star Expression
- Subqueries
- TRY
- Functions
- Overview
- Aggregate Functions
- Array Functions
- Bitstring Functions
- Blob Functions
- Date Format Functions
- Date Functions
- Date Part Functions
- Enum Functions
- Interval Functions
- Lambda Functions
- List Functions
- Map Functions
- Nested Functions
- Numeric Functions
- Pattern Matching
- Regular Expressions
- Struct Functions
- Text Functions
- Time Functions
- Timestamp Functions
- Timestamp with Time Zone Functions
- Union Functions
- Utility Functions
- Window Functions
- Constraints
- Indexes
- Meta Queries
- DuckDB's SQL Dialect
- Overview
- Indexing
- Friendly SQL
- Keywords and Identifiers
- Order Preservation
- PostgreSQL Compatibility
- SQL Quirks
- Samples
- Configuration
- Extensions
- Overview
- Installing Extensions
- Advanced Installation Methods
- Distributing Extensions
- Versioning of Extensions
- Troubleshooting of Extensions
- Core Extensions
- Overview
- AutoComplete
- Avro
- AWS
- Azure
- Delta
- DuckLake
- Encodings
- Excel
- Full Text Search
- httpfs (HTTP and S3)
- Iceberg
- Overview
- Iceberg REST Catalogs
- Amazon S3 Tables
- Amazon SageMaker Lakehouse (AWS Glue)
- Troubleshooting
- ICU
- inet
- jemalloc
- MySQL
- PostgreSQL
- Spatial
- SQLite
- TPC-DS
- TPC-H
- UI
- VSS
- Guides
- Overview
- Data Viewers
- Database Integration
- File Formats
- Overview
- CSV Import
- CSV Export
- Directly Reading Files
- Excel Import
- Excel Export
- JSON Import
- JSON Export
- Parquet Import
- Parquet Export
- Querying Parquet Files
- File Access with the file: Protocol
- Network and Cloud Storage
- Overview
- HTTP Parquet Import
- S3 Parquet Import
- S3 Parquet Export
- S3 Iceberg Import
- S3 Express One
- GCS Import
- Cloudflare R2 Import
- DuckDB over HTTPS / S3
- Fastly Object Storage Import
- Meta Queries
- Describe Table
- EXPLAIN: Inspect Query Plans
- EXPLAIN ANALYZE: Profile Queries
- List Tables
- Summarize
- DuckDB Environment
- ODBC
- Performance
- Overview
- Environment
- Import
- Schema
- Indexing
- Join Operations
- File Formats
- How to Tune Workloads
- My Workload Is Slow
- Benchmarks
- Working with Huge Databases
- Python
- Installation
- Executing SQL
- Jupyter Notebooks
- marimo Notebooks
- SQL on Pandas
- Import from Pandas
- Export to Pandas
- Import from Numpy
- Export to Numpy
- SQL on Arrow
- Import from Arrow
- Export to Arrow
- Relational API on Pandas
- Multiple Python Threads
- Integration with Ibis
- Integration with Polars
- Using fsspec Filesystems
- SQL Editors
- SQL Features
- Snippets
- Creating Synthetic Data
- Dutch Railway Datasets
- Sharing Macros
- Analyzing a Git Repository
- Importing Duckbox Tables
- Copying an In-Memory Database to a File
- Troubleshooting
- Glossary of Terms
- Browsing Offline
- Operations Manual
- Overview
- DuckDB's Footprint
- Logging
- Securing DuckDB
- Non-Deterministic Behavior
- Limits
- Development
- DuckDB Repositories
- Profiling
- Building DuckDB
- Overview
- Build Configuration
- Building Extensions
- Android
- Linux
- macOS
- Raspberry Pi
- Windows
- Python
- R
- Troubleshooting
- Unofficial and Unsupported Platforms
- Benchmark Suite
- Testing
- Internals
- Why DuckDB
- Code of Conduct
- Release Calendar
- Roadmap
- Sitemap
- Live Demo
You can create a DuckDB user-defined function (UDF) from a Python function so it can be used in SQL queries. Similarly to regular functions, they need to have a name, a return type and parameter types.
Here is an example using a Python function that calls a third-party library.
import duckdb
from duckdb.typing import *
from faker import Faker
def generate_random_name():
fake = Faker()
return fake.name()
duckdb.create_function("random_name", generate_random_name, [], VARCHAR)
res = duckdb.sql("SELECT random_name()").fetchall()
print(res)
[('Gerald Ashley',)]
Creating Functions
To register a Python UDF, use the create_function
method from a DuckDB connection. Here is the syntax:
import duckdb
con = duckdb.connect()
con.create_function(name, function, parameters, return_type)
The create_function
method takes the following parameters:
name
A string representing the unique name of the UDF within the connection catalog.function
The Python function you wish to register as a UDF.parameters
Scalar functions can operate on one or more columns. This parameter takes a list of column types used as input.return_type
Scalar functions return one element per row. This parameter specifies the return type of the function.type
(optional): DuckDB supports both built-in Python types and PyArrow Tables. By default, built-in types are assumed, but you can specifytype = 'arrow'
to use PyArrow Tables.null_handling
(optional): By default,NULL
values are automatically handled asNULL
-inNULL
-out. Users can specify a desired behavior forNULL
values by settingnull_handling = 'special'
.exception_handling
(optional): By default, when an exception is thrown from the Python function, it will be re-thrown in Python. Users can disable this behavior, and instead returnNULL
, by setting this parameter to'return_null'
side_effects
(optional): By default, functions are expected to produce the same result for the same input. If the result of a function is impacted by any type of randomness,side_effects
must be set toTrue
.
To unregister a UDF, you can call the remove_function
method with the UDF name:
con.remove_function(name)
Using Partial Functions
DuckDB UDFs can also be created with Python partial functions.
In the below example, we show how a custom logger will return the concatenation of the execution datetime in ISO format, always followed by argument passed at UDF creation and the input parameter provided to the function call:
from datetime import datetime
import duckdb
import functools
def get_datetime_iso_format() -> str:
return datetime.now().isoformat()
def logger_udf(func, arg1: str, arg2: int) -> str:
return ' '.join([func(), arg1, str(arg2)])
with duckdb.connect() as con:
con.sql("select * from range(10) tbl(id)").to_table("example_table")
con.create_function(
'custom_logger',
functools.partial(logger_udf, get_datetime_iso_format, 'logging data')
)
rel = con.sql("SELECT custom_logger(id) from example_table;")
rel.show()
con.create_function(
'another_custom_logger',
functools.partial(logger_udf, get_datetime_iso_format, ':')
)
rel = con.sql("SELECT another_custom_logger(id) from example_table;")
rel.show()
┌───────────────────────────────────────────┐
│ custom_logger(id) │
│ varchar │
├───────────────────────────────────────────┤
│ 2025-03-27T12:07:56.811251 logging data 0 │
│ 2025-03-27T12:07:56.811264 logging data 1 │
│ 2025-03-27T12:07:56.811266 logging data 2 │
│ 2025-03-27T12:07:56.811268 logging data 3 │
│ 2025-03-27T12:07:56.811269 logging data 4 │
│ 2025-03-27T12:07:56.811270 logging data 5 │
│ 2025-03-27T12:07:56.811271 logging data 6 │
│ 2025-03-27T12:07:56.811272 logging data 7 │
│ 2025-03-27T12:07:56.811274 logging data 8 │
│ 2025-03-27T12:07:56.811275 logging data 9 │
├───────────────────────────────────────────┤
│ 10 rows │
└───────────────────────────────────────────┘
┌────────────────────────────────┐
│ another_custom_logger(id) │
│ varchar │
├────────────────────────────────┤
│ 2025-03-27T12:07:56.812106 : 0 │
│ 2025-03-27T12:07:56.812116 : 1 │
│ 2025-03-27T12:07:56.812118 : 2 │
│ 2025-03-27T12:07:56.812119 : 3 │
│ 2025-03-27T12:07:56.812121 : 4 │
│ 2025-03-27T12:07:56.812122 : 5 │
│ 2025-03-27T12:07:56.812123 : 6 │
│ 2025-03-27T12:07:56.812124 : 7 │
│ 2025-03-27T12:07:56.812126 : 8 │
│ 2025-03-27T12:07:56.812127 : 9 │
├────────────────────────────────┤
│ 10 rows │
└────────────────────────────────┘
Type Annotation
When the function has type annotation it's often possible to leave out all of the optional parameters.
Using DuckDBPyType
we can implicitly convert many known types to DuckDBs type system.
For example:
import duckdb
def my_function(x: int) -> str:
return x
duckdb.create_function("my_func", my_function)
print(duckdb.sql("SELECT my_func(42)"))
┌─────────────┐
│ my_func(42) │
│ varchar │
├─────────────┤
│ 42 │
└─────────────┘
If only the parameter list types can be inferred, you'll need to pass in None
as parameters
.
NULL
Handling
By default when functions receive a NULL
value, this instantly returns NULL
, as part of the default NULL
-handling.
When this is not desired, you need to explicitly set this parameter to "special"
.
import duckdb
from duckdb.typing import *
def dont_intercept_null(x):
return 5
duckdb.create_function("dont_intercept", dont_intercept_null, [BIGINT], BIGINT)
res = duckdb.sql("SELECT dont_intercept(NULL)").fetchall()
print(res)
[(None,)]
With null_handling="special"
:
import duckdb
from duckdb.typing import *
def dont_intercept_null(x):
return 5
duckdb.create_function("dont_intercept", dont_intercept_null, [BIGINT], BIGINT, null_handling="special")
res = duckdb.sql("SELECT dont_intercept(NULL)").fetchall()
print(res)
[(5,)]
Always use
null_handling="special"
when the function can return NULL.
import duckdb
from duckdb.typing import VARCHAR
def return_str_or_none(x: str) -> str | None:
if not x:
return None
return x
duckdb.create_function(
"return_str_or_none",
return_str_or_none,
[VARCHAR],
VARCHAR,
null_handling="special"
)
res = duckdb.sql("SELECT return_str_or_none('')").fetchall()
print(res)
[(None,)]
Exception Handling
By default, when an exception is thrown from the Python function, we'll forward (re-throw) the exception.
If you want to disable this behavior, and instead return NULL
, you'll need to set this parameter to "return_null"
.
import duckdb
from duckdb.typing import *
def will_throw():
raise ValueError("ERROR")
duckdb.create_function("throws", will_throw, [], BIGINT)
try:
res = duckdb.sql("SELECT throws()").fetchall()
except duckdb.InvalidInputException as e:
print(e)
duckdb.create_function("doesnt_throw", will_throw, [], BIGINT, exception_handling="return_null")
res = duckdb.sql("SELECT doesnt_throw()").fetchall()
print(res)
Invalid Input Error: Python exception occurred while executing the UDF: ValueError: ERROR
At:
...(5): will_throw
...(9): <module>
[(None,)]
Side Effects
By default DuckDB will assume the created function is a pure function, meaning it will produce the same output when given the same input.
If your function does not follow that rule, for example when your function makes use of randomness, then you will need to mark this function as having side_effects
.
For example, this function will produce a new count for every invocation.
def count() -> int:
old = count.counter;
count.counter += 1
return old
count.counter = 0
If we create this function without marking it as having side effects, the result will be the following:
con = duckdb.connect()
con.create_function("my_counter", count, side_effects=False)
res = con.sql("SELECT my_counter() FROM range(10)").fetchall()
print(res)
[(0,), (0,), (0,), (0,), (0,), (0,), (0,), (0,), (0,), (0,)]
Which is obviously not the desired result, when we add side_effects=True
, the result is as we would expect:
con.remove_function("my_counter")
count.counter = 0
con.create_function("my_counter", count, side_effects=True)
res = con.sql("SELECT my_counter() FROM range(10)").fetchall()
print(res)
[(0,), (1,), (2,), (3,), (4,), (5,), (6,), (7,), (8,), (9,)]
Python Function Types
Currently, two function types are supported, native
(default) and arrow
.
Arrow
If the function is expected to receive arrow arrays, set the type
parameter to 'arrow'
.
This will let the system know to provide arrow arrays of up to STANDARD_VECTOR_SIZE
tuples to the function, and also expect an array of the same amount of tuples to be returned from the function.
Native
When the function type is set to native
the function will be provided with a single tuple at a time, and expect only a single value to be returned.
This can be useful to interact with Python libraries that don't operate on Arrow, such as faker
:
import duckdb
from duckdb.typing import *
from faker import Faker
def random_date():
fake = Faker()
return fake.date_between()
duckdb.create_function("random_date", random_date, [], DATE, type="native")
res = duckdb.sql("SELECT random_date()").fetchall()
print(res)
[(datetime.date(2019, 5, 15),)]