You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Computer simulations of the Earth’s climate and weather generate huge amounts of data.
These data are often persisted on HPC systems or in the cloud across multiple data
assets of a variety of formats (netCDF, zarr, etc...). Finding, investigating,
loading these data assets into compute-ready data containers costs time and effort.
The data user needs to know what data sets are available, the attributes describing
each data set, before loading a specific data set and analyzing it.
Finding, investigating, loading these assets into data array containers
such as xarray can be a daunting task due to the large number of files
a user may be interested in. Intake-esm aims to address these issues by
providing necessary functionality for searching, discovering, data access/loading.
Overview
intake-esm is a data cataloging utility built on top of intake, pandas, polars and xarray, and it's pretty awesome!
Opening an ESM catalog definition file: An Earth System Model (ESM) catalog file is a JSON file that conforms
to the ESM Collection Specification. When provided a link/path to an esm catalog file, intake-esm establishes
a link to a database (CSV file) that contains data assets locations and associated metadata
(i.e., which experiment, model, the come from). The catalog JSON file can be stored on a local filesystem
or can be hosted on a remote server.
In [1]: importintakeIn [2]: importintake_esmIn [3]: cat_url=intake_esm.tutorial.get_url("google_cmip6")
In [4]: cat=intake.open_esm_datastore(cat_url)
In [5]: catOut[5]: <GOOGLE-CMIP6catalogwith4dataset(s) from261asset(s>
Search and Discovery: intake-esm provides functionality to execute queries against the catalog:
In [5]: cat_subset=cat.search(
...: experiment_id=["historical", "ssp585"],
...: table_id="Oyr",
...: variable_id="o2",
...: grid_label="gn",
...: )
In [6]: cat_subsetOut[6]: <GOOGLE-CMIP6catalogwith2dataset(s) from67asset(s)>
Access: when the user is satisfied with the results of their query, they can load data assets (netCDF and/or Zarr stores) into xarray datasets:
In [7]: dset_dict=cat_subset.to_dataset_dict()
-->Thekeysinthereturneddictionaryofdatasetsareconstructedasfollows:
'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'|███████████████████████████████████████████████████████████████|100.00% [2/200:18<00:00]