CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 11
Recognizing Dataset File Formats Using DROID
This github wiki documents the technology behind the Linked Data aggregation site https://healthdata.tw.rpi.edu.
- Mirroring a Source CKAN Instance to get a DCAT metadata description, which includes the download URL.
- Retrieving CKAN's Dataset Distribution Files acting upon the DCAT access metadata to download the data files.
- DROID is a tool that works with PRONOM - a project from the UK's National Archives.
This page describes how to use DRIOD to recognize the file formats for the datasets listed in CKAN. We use medlineplus-health-topic-files/dcat.ttl as an example.
We start by knowing the download URL of a dataset, which is given in the dcat.ttl as discussed in Mirroring a Source CKAN Instance (it says that it is available as XML):
@prefix rdfs: <https://www.w3.org/2000/01/rdf-schema#> .
@prefix dcterms: <https://purl.org/dc/terms/> .
@prefix conversion: <https://purl.org/twc/vocab/conversion/> .
@prefix dcat: <https://www.w3.org/ns/dcat#> .
@prefix void: <https://rdfs.org/ns/void#> .
@prefix prov: <https://www.w3.org/ns/prov#> .
@prefix datafaqs: <https://purl.org/twc/vocab/datafaqs#> .
@prefix : <https://purl.org/twc/health/id/> .
<https://purl.org/twc/health/source/hub-healthdata-gov/dataset/medlineplus-health-topic-files>
a void:Dataset, dcat:Dataset;
conversion:source_identifier "hub-healthdata-gov";
conversion:dataset_identifier "medlineplus-health-topic-files";
prov:wasDerivedFrom :as_a_xml_d3f7e94a20273d1937878d5baf53197a;
.
:as_a_xml_d3f7e94a20273d1937878d5baf53197a
a dcat:Distribution;
dcat:downloadURL <https://www.nlm.nih.gov/medlineplus/xml/mplus_topics_compressed_2012-12-15.zip>;
dcterms:format [ rdfs:label "XML" ];
.
<https://healthdata.tw.rpi.edu/hub/dataset/medlineplus-health-topic-files>
a dcat:Dataset, datafaqs:CKANDataset;
dcat:distribution :as_a_xml_d3f7e94a20273d1937878d5baf53197a;
prov:wasAttributedTo <https://healthdata.tw.rpi.edu>;
.
<https://healthdata.tw.rpi.edu/hub/dataset/medlineplus-health-topic-files>
prov:alternateOf <https://hub.healthdata.gov/dataset/medlineplus-health-topic-files>;
.
<https://hub.healthdata.gov/dataset/medlineplus-health-topic-files>
a dcat:Dataset, datafaqs:CKANDataset;
prov:alternateOf <https://healthdata.tw.rpi.edu/hub/dataset/medlineplus-health-topic-files>;
prov:wasAttributedTo <https://hub.healthdata.gov>;
.
#3> <> prov:wasGeneratedBy [
#3> a prov:Activity;
#3> prov:qualifiedAssociation [
#3> a prov:Association;
#3> prov:hadPlan <https://github.com/timrdf/csv2rdf4lod-automation/blob/master/bin/cr-create-dataset-dirs-from-ckan.py>;
#3> ];
#3> rdfs:seeAlso <https://github.com/jimmccusker/twc-healthdata/wiki/Accessing-CKAN-listings>;
#3> ] .
#3> <https://github.com/timrdf/csv2rdf4lod-automation/blob/master/bin/cr-create-dataset-dirs-from-ckan.py>
#3> a prov:Plan;
#3> dcterms:title "cr-create-dataset-dirs-from-ckan.py" ;
#3> .
cr-retrieve.sh downloads the file according to the DCAT above. The following commands will download the file and unpack any zip files (while capturing provenance):
>cd medlineplus-health-topic-files
>cr-retrieve.sh
...
>cd /version/2012-Dec-15/source
>ls
mplus_topics_2012-12-15.xml
mplus_topics_2012-12-15.xml.pml.ttl
mplus_topics_compressed_2012-12-15.zip
mplus_topics_compressed_2012-12-15.zip.pml.ttl
cr-droid.sh accepts a list of directories or files to find the file format. It prints a Turtle description to stdout. References to the files are relative to where the script was run.
>cr-droid.sh --help
usage: cr-droid.sh [--help] (--conversion-cockpit-sources | (<dir> | <file)+)
--conversion-cockpit-sources: identify all files in every conversion cockpit's source/ directory.
<dir> and <file> arguments have no affect with this option.
<dir>: a directory whose files should be format identified.
<file>: a file that should be format identified.
>cr-droid.sh . > droid.ttl; cat droid.ttl
...
@prefix dcterms: <https://purl.org/dc/terms/> .
<mplus_topics_compressed_2012-12-15.zip> dcterms:format <https://provenanceweb.org/formats/pronom/x-fmt/263> .
<mplus_topics_compressed_2012-12-15.zip>
dcterms:hasPart <mplus_topics_compressed_2012-12-15.zip/mplus_topics_2012-12-15.xml>;
.
<mplus_topics_compressed_2012-12-15.zip/mplus_topics_2012-12-15.xml>
dcterms:isPartOf <mplus_topics_compressed_2012-12-15.zip>;
dcterms:format <https://provenanceweb.org/formats/pronom/fmt/101>;
.
<mplus_topics_2012-12-15.xml> dcterms:format <https://provenanceweb.org/formats/pronom/fmt/101> .
The argument --conversion-cockpit-sources
will write a file cr-droid.ttl
in all conversion cockpits' source/
directories.
$ cr-pwd.sh
source/
$ cr-droid.sh --conversion-cockpit-sources
[INFO] source/hub-healthdata-gov/eldercare-locator-database/version/2012-Dec-19 already has source/cr-droid.ttl; skipping.
[INFO] source/hub-healthdata-gov/drugs-lactation-database-lactmed/version/2012-Dec-19 already has source/cr-droid.ttl; skipping.
...