CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 36
Script: pcurl.sh
- pcurl.sh can and should be used in Conversion process phase: retrieve
pcurl.sh is one of the most essential scripts for providing transparency in csv2rdf4lod.
While the rest of csv2rdf4lod-automation is dedicated to converting and publishing well-structured, highly connected RDF from tabular data, pcurl.sh captures the implicit connection from all of our local processing results to the original data provided by a more authoritative source organization. Associating our local results to the original data source enables accountability, repeatability, and attribution -- essential aspects for establishing trust in our third party enhancements. pcurl.sh
helps us fulfill our Design Objective: Capturing and Exposing Provenance.
Conversion process phase: retrieve also shows an example for how pcurl.sh
is used.
Script location: $CSV2RDF4LOD_HOME/bin/pcurl.sh
bash-3.2$ pcurl.sh
usage: pcurl.sh [-I] url [-n name] [-e extension] [url [-n name] [-e extension]]*
-I : do not download file; just obtain HTTP header information (c.f. curl -I)
url : the URL to retrieve
-n : use 'name' as the local file name.
-e : use 'extension' as the extension to the local file name.
pcurl.sh https://www.whitehouse.gov/files/disclosures/visitors/WhiteHouse-WAVES-Released-0111.zip
creates WhiteHouse-WAVES-Released-0111.zip
and WhiteHouse-WAVES-Released-0111.zip.pml.ttl
in the current working directory.
The remaining blocks of PML encoded in Turtle are one continuous provenance capture of the URL retrieval shown above, with comments describing the subsequent representation.
@prefix rdfs: <https://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <https://www.w3.org/2001/XMLSchema#> .
@prefix dcterms: <https://purl.org/dc/terms/> .
@prefix pmlp: <https://inference-web.org/2.0/pml-provenance.owl#> .
@prefix pmlj: <https://inference-web.org/2.0/pml-justification.owl#> .
@prefix irw: <https://www.ontologydesignpatterns.org/ont/web/irw.owl#> .
@prefix nfo: <https://www.semanticdesktop.org/ontologies/nfo/#> .
@prefix conv: <https://purl.org/twc/vocab/conversion/> .
@prefix httphead: <https://inference-web.org/registry/MPR/HTTP_1_1_HEAD.owl#> .
@prefix httpget: <https://inference-web.org/registry/MPR/HTTP_1_1_GET.owl#> .
The URL from which we retrieved our file, with a modification date reported by the HTTP server:
<https://www.whitehouse.gov/files/disclosures/visitors/WhiteHouse-WAVES-Released-0111.zip>
a pmlp:Source;
.
<https://www.whitehouse.gov/files/disclosures/visitors/WhiteHouse-WAVES-Released-0111.zip>
a pmlp:Source;
pmlp:hasModificationDateTime "2011-01-28T23:19:12"^^xsd:dateTime;
.
The file on our local disk, which we md5 hashed:
<WhiteHouse-WAVES-Released-0111.zip>
a pmlp:Information;
pmlp:hasReferenceSourceUsage <sourceUsage_4921a599-96e2-46a3-bb6e-e6769d4b3fdf_content>;
nfo:hasHash <md5_b76602e45b2e9a76869200b877d01f1c>;
.
<md5_b76602e45b2e9a76869200b877d01f1c>
a nfo:FileHash;
nfo:hashAlgorithm "md5";
nfo:hasHash "b76602e45b2e9a76869200b877d01f1c";
.
Justifying the existence of our file on disk as the result of curl
HTTP requesting the file:
<nodeSet_4921a599-96e2-46a3-bb6e-e6769d4b3fdf_content>
a pmlj:NodeSet;
pmlj:hasConclusion <WhiteHouse-WAVES-Released-0111.zip>;
pmlj:isConsequentOf [
a pmlj:InferenceStep;
pmlj:hasIndex 0;
pmlj:hasAntecedentList ();
pmlj:hasSourceUsage <sourceUsage_4921a599-96e2-46a3-bb6e-e6769d4b3fdf_content>;
pmlj:hasInferenceEngine conv:curl_md5_5670dffdc5533a4c57243fc97b19a654;
pmlj:hasInferenceRule httpget:HTTP_1_1_GET;
];
.
The time that we retrieved the file:
<sourceUsage_4921a599-96e2-46a3-bb6e-e6769d4b3fdf_content>
a pmlp:SourceUsage;
pmlp:hasSource <https://www.whitehouse.gov/files/disclosures/visitors/WhiteHouse-WAVES-Released-0111.zip>;
pmlp:hasUsageDateTime "2011-02-22T22:30:42-05:00"^^xsd:dateTime;
.
The header information of the HTTP retrieval:
<info_4921a599-96e2-46a3-bb6e-e6769d4b3fdf_url_header>
a pmlp:Information, conv:HTTPHeader;
pmlp:hasRawString """HTTP/1.1 200 OK
ETag: "b76602e45b2e9a76869200b877d01f1c:1296256752"
Last-Modified: Fri, 28 Jan 2011 23:19:12 GMT
Accept-Ranges: bytes
Content-Length: 1247427
Content-Type: application/zip
Date: Wed, 23 Feb 2011 03:30:41 GMT
Connection: keep-alive
Server: White House
P3P: CP="NON DSP COR ADM DEV IVA OTPi OUR LEG"
""";
pmlp:hasReferenceSourceUsage <sourceUsage_4921a599-96e2-46a3-bb6e-e6769d4b3fdf_url_header>;
.
Justifying the HTTP header as the result of curl
requesting an HTTP HEAD:
<nodeSet_4921a599-96e2-46a3-bb6e-e6769d4b3fdf_url_header>
a pmlj:NodeSet;
pmlj:hasConclusion <info_4921a599-96e2-46a3-bb6e-e6769d4b3fdf_url_header>;
pmlj:isConsequentOf [
a pmlj:InferenceStep;
pmlj:hasIndex 0;
pmlj:hasAntecedentList ();
pmlj:hasSourceUsage <sourceUsage_4921a599-96e2-46a3-bb6e-e6769d4b3fdf_url_header>;
pmlj:hasInferenceEngine conv:curl_md5_5670dffdc5533a4c57243fc97b19a654;
pmlj:hasInferenceRule httphead:HTTP_1_1_HEAD;
];
.
The time that we retrieved the HTTP HEAD:
<sourceUsage_4921a599-96e2-46a3-bb6e-e6769d4b3fdf_url_header>
a pmlp:SourceUsage;
pmlp:hasSource <https://www.whitehouse.gov/files/disclosures/visitors/WhiteHouse-WAVES-Released-0111.zip>;
pmlp:hasUsageDateTime "2011-02-22T22:30:42-05:00"^^xsd:dateTime;
.
Identifying the curl
implementation that performed the retrievals for us:
conv:curl_md5_5670dffdc5533a4c57243fc97b19a654
a pmlp:InferenceEngine, conv:Curl;
dcterms:identifier "md5_5670dffdc5533a4c57243fc97b19a654";
dcterms:description """curl 7.19.7 (universal-apple-darwin10.0) libcurl/7.19.7 OpenSSL/0.9.8l zlib/1.2.3
Protocols: tftp ftp telnet dict ldap http file https ftps
Features: GSS-Negotiate IPv6 Largefile NTLM SSL libz """;
.
conv:Curl rdfs:subClassOf pmlp:InferenceEngine .
Invoking
pcurl.sh https://www.data.gov/download/1554/csv -n 1554 -e csv
results in this PML, which is illustrated in this pdf diagram (NOTE: pre-attribution illustration).
Note the provenir:
, hartigprov:
and oprov:
attributes associating the InferenceStep to a user account, which in turn is related to a Person.
(Note: using VSR, which isn't published yet)
pcurl.sh `vsr-query-redirect-beginning-sources.sh ../../2010-10-31/source/sparql-service-description.ttl.pml.ttl` -e ttl
The curl
command:
curl https://www.epa-echo.gov/cgi-bin/effluentdata.cgi \
-F "permit=NY0261343" -F "hits=1" > NY0261343.csv
can be performed using pcurl.sh:
pcurl.sh https://www.epa-echo.gov/cgi-bin/effluentdata.cgi \
-F "permit=NY0261343" -F "hits=1" -n NY0261343 -e csv
and captures the POST fields and values at pmlj:hasVariableMapping:
<inferenceStep_d159b3fc-7c38-44e1-b68d-b1d591385d68_content>
a pmlj:InferenceStep;
pmlj:hasIndex 0;
pmlj:hasAntecedentList ();
pmlj:hasSourceUsage <sourceUsage_d159b3fc-7c38-44e1-b68d-b1d591385d68_content>;
pmlj:hasInferenceEngine conv:curl_md5_5670dffdc5533a4c57243fc97b19a654;
pmlj:hasInferenceRule httppost:HTTP_1_1_POST;
oboro:has_agent <https://tw.rpi.edu/web/inside/machine/lebot_macbook#lebot>;
hartigprov:involvedActor <https://tw.rpi.edu/web/inside/machine/lebot_macbook#lebot>;
pmlj:hasVariableMapping [ pmlj:mapFrom "permit"; pmlj:mapTo "NY0261343"; ];
pmlj:hasVariableMapping [ pmlj:mapFrom "hits"; pmlj:mapTo "1"; ];
.
- bin/util/punzip.sh will extract files from a compressed file and include provenance for each file extracted.
- Script: pvload.sh uses pcurl.sh when it retrieves a URL to load into a SPARQL endpoint.
- Conversion process phase: csv ify describes how to use justify.sh to add provenance assertions for manual edits of a file.