CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 36
Directory Conventions
As described in Conversion process phase: name, csv2rdf4lod organizes third party data according to who provided it (source), what they were talking about (dataset), and when they said it (version). Short identifiers for each of these three aspects are combined to create the URI for the gathered dataset. This organizational scheme allows a data aggregator and curator to bring order to the ad hoc ways that data providers may offer their data.
To be consistent, we organize the physical filesystem directory according to the same logical organization: by source, dataset, and version. The filesystem is constructed during the Conversion process phase: retrieve.
Since the datasets that we gather are organized in the filesystem according to source, dataset, and version, the shell scripts in csv2rdf4lod-automation expect this same structure.
The logical, physical, and operational organization of the aggregated data are consistently oriented around the three essential aspects: source, dataset, and version identifiers of the dataset being collected, retrieved, converted, and published.
See Conversion process phase: retrieve for a walk through on creating a directory structure to retrieve a third party's dataset.
The filesystem directory structure that csv2rdf4lod uses to organize 1) data retrieved from third parties, 2) any modifications that an aggregator may perform, and 3) the RDF conversion outputs.
Following these conventions allows others to orient with what another developer has previously done, and facilitates collaboration among developers that are curating the same data sources. The helper scripts in csv2rdf4lod-automation also assume this directory structure when they perform their activities.
To illustrate the directory convention for gathering, manipulating, and publishing third parties' data, we'll exercise the cr-pwd-type.sh
script from the deepest directory back to the conversion root.
Running $CSV2RDF4LOD_HOME
/bin/cr-pwd-type.sh will tell you what type of csv2rdf4lod directory you are in. First, the conversion cockpit is the place where a specific dataset is collected, manipulated, converted, and published. The higher directories are simply organizing conversion cockpits.
/opt/logd/data/source/worldbank-org/world-development-indicators/version/2011-Jul-29$ cr-pwd-type.sh ; cd ..
cr:conversion-cockpit
/opt/logd/data/source/worldbank-org/world-development-indicators/version$ cr-pwd-type.sh ; cd ..
cr:directory-of-versions
/opt/logd/data/source/worldbank-org/world-development-indicators$ cr-pwd-type.sh ; cd ..
cr:dataset
/opt/logd/data/source/worldbank-org$ cr-pwd-type.sh ; cd ..
cr:source
/opt/logd/data/source$ cr-pwd-type.sh ; cd ..
cr:data-root
/opt/logd/data$ cr-pwd-type.sh
Not recognized; see https://github.com/timrdf/csv2rdf4lod-automation/wiki/Directory-Conventions
/opt/logd/data$ cr-pwd-type.sh --types
cr:data-root cr:source cr:directory-of-datasets cr:dataset cr:directory-of-versions cr:conversion-cockpit
See Conversion process phase: retrieve for a tutorial to create the directory structure from the ground up, to retrieve a dataset in preparation to convert it to RDF.
This section contains technical notes for how to write automation scripts to work within the directory structure.
Running $CSV2RDF4LOD_HOME/bin/util/is-pwd-a.sh will return yes if the current directory is of the given type or no if it is not. It also lists the possible types with --types
.
$CSV2RDF4LOD_HOME/bin/util/pwd-not-a.sh returns ... and lists the types with --types
.
Unfortunately, many data providers do not make it straightforward to obtain their data, rendering a direct URL request inadequate.
If our source identifier, dataset identifier, and version identifier were SSS
, DDD
, and VVV
, respectively, the directory structure becomes:
what-you-want/source/SSS/DDD/version/VVV/source/their.csv
what-you-want/source/SSS/DDD/version/VVV/source/their.csv.pml.ttl
If it took more than a URL request to get their.csv
, some custom code might be required. In this case, we recommend setting up shop at:
what-you-want/source/SSS/DDD/src/
what-you-want/source/SSS/DDD/bin/
Then, when using DDD/version/2source.sh
(see Automated creation of a new Versioned Dataset) to automatically obtain the source organization's data, it invokes stuff in DDD/bin
or DDD/src
to invoke the super special scrapers that needed to be cobbled together.
Some automation scripts only make sense to run within certain types of directories. For other scripts, it may make sense to do different things according to the directory type from which it is invoked.
Each directory level below the conversion root has a csv2rdf4lod directory type (from root to deepest):
-
cr:data-root (e.g.
source/
) -
cr:source (e.g.
source/hub-healthdata-gov
) - cr:directory-of-datasets
- cr:dataset
- cr:directory-of-versions
-
cr:conversion-cockpit (e.g.
source/hub-healthdata-gov/hospital-compare/version/2012-Jul-17
)
(cr:bone and cr:dev are also valid tests)
$CSV2RDF4LOD_HOME/bin/cr-dataset-uri.sh shows a simple boilerplate that can be used to check that the script is running in the expected directories. This abstracts away the actual location, consolidating the logic into a single location. Calling $CSV2RDF4LOD_HOME/bin/pwd-not-a.sh will print consistent error messages for the given expected directory types.
see='https://github.com/timrdf/csv2rdf4lod-automation/wiki/CSV2RDF4LOD-not-set'
CSV2RDF4LOD_HOME=${CSV2RDF4LOD_HOME:?"not set; source csv2rdf4lod/source-me.sh or see $see"}
# cr:data-root cr:source cr:directory-of-datasets cr:dataset cr:directory-of-versions cr:conversion-cockpit
ACCEPTABLE_PWDs="cr:dataset cr:directory-of-versions cr:conversion-cockpit"
if [ `${CSV2RDF4LOD_HOME}/bin/util/is-pwd-a.sh $ACCEPTABLE_PWDs` != "yes" ]; then
${CSV2RDF4LOD_HOME}/bin/util/pwd-not-a.sh $ACCEPTABLE_PWDs
exit 1
fi
TEMP="_"`basename $0``date +%s`_$$.tmp
For example, since $CSV2RDF4LOD_HOME/bin/cr-dataset-uri.sh only works in directories of type cr:dataset, cr:directory-of-versions, cr:conversion-cockpit, it will use $CSV2RDF4LOD_HOME/bin/pwd-not-a.sh to provide the following error output:
bash-3.2$ cr-dataset-uri.sh
Working directory does not appear to be a dataset. You can run this from a dataset.
(e.g. $whatever/source/mySOURCE/myDATASET/).
Working directory does not appear to be a directory of versions. You can run this from a directory of versions.
(e.g. $whatever/source/mySOURCE/myDATASET/version/).
Working directory does not appear to be a conversion cockpit.
You can run this from a conversion cockpit.
(e.g. $whatever/source/mySOURCE/myDATASET/version/VVV/).
$CSV2RDF4LOD_HOME/bin/util/cr-trim-logs.sh is an initial example for how to use the is-pwd-a.sh pattern to recursively process the entire data skeleton.
sourceID=`is-pwd-a.sh cr:bone --id-of source`
datasetID=`is-pwd-a.sh cr:bone --id-of dataset`
versionID=`is-pwd-a.sh cr:bone --id-of version`
Note that the shell scripts provided in csv2rdf4lod-automation are not required to use csv2rdf4lod.jar. So if you don't like these directory conventions and want to start from scratch to name, retrieve, tweak, convert, and publish everything that revolves around invoking the jar, feel free!
(See this list, too.)
- bin/cr-create-versioned-dataset-dir.sh
- bin/cr-dataset-uri.sh
- bin/cr-publish-cockpit.sh
- bin/cr-publish-params-to-endpoint.sh
- bin/cr-publish-sameas-to-endpoint.sh
- bin/cr-publish-void-to-endpoint.sh
- bin/cr-pull-conversion-triggers.sh
- bin/util/cr-headers.sh
- bin/util/cr-list-enhancement-identifiers.sh
- bin/util/cr-list-sources-datasets.sh
- bin/util/cr-list-versions.sh
- bin/util/cr-load-endpoint-metadata.sh
- bin/util/cr-load-endpoint-metadataset.sh
- bin/util/cr-make-today-version.sh
- bin/util/cr-test-conversion.sh
- bin/util/cr-trim-logs.sh
- bin/util/cr-trim-reproducible-output.sh
- bin/util/google2source.sh
Parts of the pattern:
- pwd-not-a.sh
- is-pwd-a.sh
- cr-pwd.sh
(Note: anything with ANCHOR
in it needs to be updated to the new boilerplate)
Recursive automation scripts are directory-type sensitive, but also call themselves when at a particular directory type. (starred *
scripts are exemplars for the pattern; fewer (non-zero) pluses +
means developed more recently).
(See this list, too.)
- bin/util/cr-sdv.sh
-
bin/util/cr-list-versioned-dataset-dumps.sh +* (handles for
directories.sh
in one condition, prepared for issue 311) - util/cr-droid.sh ++
- bin/cr-publish-cockpit.sh +++*
- bin/util/cr-trim-reproducible-output.sh +++*
- bin/util/cr-trim-logs.sh +++*
- bin/util/cr-list-versions.sh +++
- bin/cr-pull-conversion-triggers.sh +++
- bin/util/cr-test-conversion.sh +++
- bin/util/cr-conversion-root.sh ++++* (walks down the directories, not up)
- bin/cr-list-conversion-triggers.sh +++++* (uses find instead of enumerating depths)
- bin/cr-retrieve.sh ++++++*
- bin/util/cr-make-today-version.sh +++++++
Any script with $0 $*
is likely to be recursive.
Using dryrun.sh in a script that doesn't dry run by default:
dryRun="false"
if [ "$1" == "-n" ]; then
dryRun="carview.php?tsp=true"
dryrun.sh $dryrun beginning
shift
fi
...
dryrun.sh $dryrun ending
Using dryrun.sh in script that dry runs by default:
write="no"
if [[ "$1" == "-w" || "$1" == "--write" ]]; then
write="yes"
shift
else
dryrun.sh 'yes beginning'
fi
...
if [ "$write" != "yes" ]; then
dryrun.sh yes ending
fi
Scripts that work from where they are:
-
bin/pl-situate-classpath.sh uses
$(cd ${0%/*} && echo ${PWD%/*})
-
bin/df-mirror-endpoint.sh uses HOME=
$(cd ${0%/*} && echo ${PWD%/*})
and me=$(cd ${0%/*} && echo ${PWD})/
basename $0
bin/dataset/pr-neighborlod.sh handles symlinks, builds in PATH and CLASSPATH:
[ -n "`readlink $0`" ] && this=`readlink $0` || this=$0
HOME=$(cd ${this%/*/*/*} && pwd)
export PATH=$PATH`$HOME/bin/install/paths.sh`
export CLASSPATH=$CLASSPATH`$HOME/bin/install/classpaths.sh`
HOME=$(cd ${0%/*} && echo ${PWD%/*})
me=$(cd ${0%/*} && echo ${PWD})/`basename $0`
spo-balance.sh:
VSR_HOME=$(cd ${0%/*} && echo ${PWD%/*})
me=$(cd ${0%/*} && echo ${PWD})/`basename $0`
bin/util/install-csv2rdf4lod-dependencies.sh:
this=$(cd ${0%/*} && echo $PWD/${0##*/})
base=${this%/bin/util/install-csv2rdf4lod-dependencies.sh}
base=${base%/*}
Setting classpath and path without depending on CSV2RDF4LOD_HOME
being set.
# from e.g. bin/cr-retrieve.sh
HOME=$(cd ${0%/*} && echo ${PWD%/*})
export PATH=$PATH`$HOME/bin/util/cr-situate-paths.sh`
export CLASSPATH=$CLASSPATH`$HOME/bin/util/cr-situate-classpaths.sh`
CSV2RDF4LOD_HOME=${CSV2RDF4LOD_HOME:?$HOME}
# from e.g. bin/util/rdf2nt.sh (one extra /*)
HOME=$(cd ${0%/*/*} && echo ${PWD%/*})
export PATH=$PATH`$HOME/bin/util/cr-situate-paths.sh`
export CLASSPATH=$CLASSPATH`$HOME/bin/util/cr-situate-classpaths.sh`
CSV2RDF4LOD_HOME=${CSV2RDF4LOD_HOME:?$HOME}