kerchunk

Cloud-friendly access to archival data

Kerchunk is a library that provides a unified way to represent a variety of chunked, compressed data formats (e.g. NetCDF, HDF5, GRIB), allowing efficient access to the data from traditional file systems or cloud object storage. It also provides a flexible way to create virtual datasets from multiple files. It does this by extracting the byte ranges, compression information and other information about the data and storing this metadata in a new, separate object. This means that you can create a virtual aggregate dataset over potentially many source files, for efficient, parallel and cloud-friendly in-situ access without having to copy or translate the originals. It is a gateway to in-the-cloud massive data processing while the data providers still insist on using legacy formats for archival storage.

Why Kerchunk:

We provide the following things:

completely serverless architecture
metadata consolidation, so you can understand a many-file dataset (metadata plus physical storage) in a single read
read from all of the storage backends supported by fsspec, including object storage (s3, gcs, abfs, alibaba), http, cloud user storage (dropbox, gdrive) and network protocols (ftp, ssh, hdfs, smb...)
loading of various file types (currently netcdf4/HDF, grib2, tiff, fits, zarr), potentially heterogeneous within a single dataset, without a need to go via the specific driver (e.g., no need for h5py)
asynchronous concurrent fetch of many data chunks in one go, amortizing the cost of latency
parallel access with a library like zarr without any locks
logical datasets viewing many (>~millions) data files, and direct access/subselection to them via coordinate indexing across an arbitrary number of dimensions

For further information, please see the documentation.

Name		Name	Last commit message	Last commit date
Latest commit History 1,003 Commits
.github/workflows		.github/workflows
binder		binder
ci		ci
docs		docs
examples		examples
kerchunk		kerchunk
tests		tests
.coveragerc		.coveragerc
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Fast_aggregation.ipynb		Fast_aggregation.ipynb
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
dynamicgribchunking.ipynb		dynamicgribchunking.ipynb
kerchunk.png		kerchunk.png
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

kerchunk

About

Uh oh!

Releases

Contributors 62

Uh oh!

Languages

License

fsspec/kerchunk

Folders and files

Latest commit

History

Repository files navigation

kerchunk

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors 62

Uh oh!

Languages