You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A lightweight data processing framework built on DuckDB and 3FS.
Features
🚀 High-performance data processing powered by DuckDB
🌍 Scalable to handle PB-scale datasets
🛠️ Easy operations with no long-running services
Installation
Python 3.8 to 3.12 is supported.
pip install smallpond
Quick Start
# Download example data
wget https://duckdb.org/data/prices.parquet
importsmallpond# Initialize sessionsp=smallpond.init()
# Load datadf=sp.read_parquet("prices.parquet")
# Process datadf=df.repartition(3, hash_by="ticker")
df=sp.partial_sql("SELECT ticker, min(price), max(price) FROM {0} GROUP BY ticker", df)
# Save resultsdf.write_parquet("output/")
# Show resultsprint(df.to_pandas())
We evaluated smallpond using the GraySort benchmark (script) on a cluster comprising 50 compute nodes and 25 storage nodes running 3FS. The benchmark sorted 110.5TiB of data in 30 minutes and 14 seconds, achieving an average throughput of 3.66TiB/min.
pip install .[dev]
# run unit tests
pytest -v tests/test*.py
# build documentation
pip install .[docs]
cd docs
make html
python -m http.server --directory build/html