CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 36
frbr:lebo2011twed2
This page has materials for my second TWed talk, which continues the introduction to csv2rdf4lod during last month's TWed talk.
For this talk (slides), I promised to cover the converter's "advanced features". Before we do (some of) that, let's review the features we covered last time:
- Integration stages: name, retrieve, (adjust), convert, (enhance), publish, (enhance).
- The "Big Three" identifiers used to name a dataset: Source, Dataset, and Version.
- Smart, naive bootstrap to contextualize the names of entities, and predicates, and classes.
- incrementally "peel away" context to increase interoperability.
- Raw versus Enhanced: two Layers describing the same entities.
- incremental and backward compatible (i.e., monotonic).
-
Enhancement parameters are declarative, RDF, lightweight, domain-independent, and re-applicable. We covered seven of them:
- Structural enhancements: conversion:HeaderRow because the header was on the fifth row.
- symbol/interpretation conversion:null (doc) to prevent empty cells from producing empty triples.
- conversion:domain_name to rdf:type the rows.
- conversion:comment to add an rdfs:comment to the resulting predicate.
- conversion:range rdfs:Resource to promote a cell value to a URI.
- conversion:range_name to rdf:type the resources created from table cells.
- conversion:links_via to assert owl:sameAs to DBPedia, GovTrack, and GeoNames.
By the end of the tutorial that night, we got the enhancement parameters from the default:
#conversion:interpret [
# conversion:symbol "";
# conversion:interpretation conversion:null;
#];
#conversion:enhance [
# conversion:domain_template "tool_[r]";
# conversion:domain_name "Tool";
#];
#conversion:enhance [
# conversion:class_name "Tool";
# conversion:subclass_of <https://purl.org/...>;
#];
conversion:enhance [
ov:csvCol 1;
ov:csvHeader "Geographic Coordinates for U.S. Farmers Markets";
#conversion:label "Geographic Coordinates for U.S. Farmers Markets";
conversion:comment "";
conversion:range todo:Literal;
];
to enhancement parameters that produces slightly better RDF:
conversion:enhance [
ov:csvRow 5;
a conversion:HeaderRow;
];
conversion:interpret [
conversion:symbol "";
conversion:interpretation conversion:null;
];
conversion:enhance [
# conversion:domain_template "tool_[r]";
conversion:domain_name "FarmersMarket";
];
#conversion:enhance [
# conversion:class_name "Tool";
# conversion:subclass_of <https://purl.org/...>;
#];
conversion:enhance [
ov:csvCol 1;
ov:csvHeader "locaddstate";
conversion:comment "State that the farmers' market is in.";
conversion:range rdfs:Resource; # was rdfs:Literal
conversion:range_name "State"; # was rdfs:Literal
# Lod-linking:(owl:sameAs)
conversion:links_via <https://www.rpi.edu/~lebot/lod-links/state-fips-dbpedia.ttl>,
<https://www.rpi.edu/~lebot/lod-links/state-fips-geonames.ttl>,
<https://www.rpi.edu/~lebot/lod-links/state-fips-govtrack.ttl>;
conversion:subject_of dcterms:identifier;
];
...
conversion:enhance [
ov:csvCol 6;
conversion:equivalent_property wgs:long;
conversion:range xsd:decimal;
];
conversion:enhance [
ov:csvCol 7;
conversion:equivalent_property wgs:lat;
conversion:range xsd:decimal;
];
The enhancement parameters above got us from raw RDF that looked like:
@prefix ds4383: <https://localhost/source/data-gov/dataset/4383/version/2011-Sep-28/> .
@prefix raw: <https://localhost/source/data-gov/dataset/4383/vocab/raw/> .
ds4383:thing_1367
dcterms:isReferencedBy <https://localhost/source/data-gov/dataset/4383/version/2011-Sep-28> ;
void:inDataset <https://localhost/source/data-gov/dataset/4383/version/2011-Sep-28> ;
raw:column_1 "Hawaii" ;
raw:column_2 "Alii Garden Market Place" ;
raw:column_3 "75-6129 Alii Drive" ;
raw:column_4 "Kailua-Kona" ;
raw:column_5 "96740" ;
raw:column_6 "-155.9819183" ;
raw:column_7 "19.61436844" ;
raw:column_8 "" ;
ov:csvRow "1367"^^xsd:integer .
to enhanced RDF that looks like:
@prefix ds4383: <https://localhost/source/data-gov/dataset/4383/version/2011-Sep-28/> .
@prefix ds4383_vocab: <https://localhost/source/data-gov/dataset/4383/vocab/> .
@prefix e1: <https://localhost/source/data-gov/dataset/4383/vocab/enhancement/1/> .
ds4383:farmersMarket_1367
dcterms:isReferencedBy <https://localhost/source/data-gov/dataset/4383/version/2011-Sep-28> ;
void:inDataset <https://localhost/source/data-gov/dataset/4383/version/2011-Sep-28> ;
a ds4383_vocab:FarmersMarket ;
e1:locaddstate typed_state:Hawaii ;
e1:mktname "Alii Garden Market Place" ;
e1:locaddst "75-6129 Alii Drive" ;
e1:locaddcity "Kailua-Kona" ;
e1:locaddzip "96740" ;
wgs:long "-155.9819183"^^xsd:decimal ;
wgs:lat "19.61436844"^^xsd:decimal ;
ov:csvRow "1367"^^xsd:integer .
@prefix govtrackusgov: <https://www.rdfabout.com/rdf/usgov/geo/us/> .
@prefix dbpedia: <https://dbpedia.org/resource/> .
typed_state:Hawaii
dcterms:identifier "Hawaii" ;
a ds4383_vocab:State ;
rdfs:label "Hawaii" ;
owl:sameAs <https://sws.geonames.org/5855797/> , govtrackusgov:HI , dbpedia:Hawaii .
Although the above enhanced RDF that we got during the tutorial is better than the raw, the following is even better because it reuses existing vocabulary that is recognized by existing systems:
@prefix con: <https://www.w3.org/2000/10/swap/pim/contact#> .
@prefix implicit_address:
<https://localhost/source/data-gov/dataset/4383/version/2011-Sep-27/http_www_w3_org_2000_10_swap_pim_contact_address/>
.
ds4383:farmersMarket_1367
dcterms:isReferencedBy <https://localhost/source/data-gov/dataset/4383/version/2011-Sep-27> ;
void:inDataset <https://localhost/source/data-gov/dataset/4383/version/2011-Sep-27> ;
a ds4383_vocab:FarmersMarket ;
con:address implicit_address:address_1367 ;
dcterms:title "Alii Garden Market Place" ;
wgs:lat "-155.9819183"^^xsd:decimal ;
wgs:long "19.61436844"^^xsd:decimal ;
ov:csvRow "1367"^^xsd:integer .
implicit_address:address_1367
a con:Address ;
con:stateOrProvince typed_state:Hawaii ,
<https://sws.geonames.org/5855797/> ,
govtrackusgov:HI , dbpedia:Hawaii ;
con:street "75-6129 Alii Drive" ;
con:city "Kailua-Kona" ;
con:zip "96740" .
typed_state:Hawaii
dcterms:identifier "Hawaii" ;
a ds4383_vocab:State ;
rdfs:label "Hawaii" ;
owl:sameAs <https://sws.geonames.org/5855797/> , govtrackusgov:HI , dbpedia:Hawaii .
- Finish up Farmers Market example
- Reconstruct (and verify) the RDF
- Start and Finish RPI Research Center example
- Reconstruct (and verify) the RDF
- Layer 1 versus Layer 2 (cell-based qb)
- Subclass enhancement
- Templates to consolidate People
- LOD-Linking from SPARQL query
- SPARQL Named graph organization
- Publish/bin/*
Since someone (me) already went through the effort to specify the enhancement parameters, others (us) can reapply them to reconstruct the same RDF. We can reconstruct it using the following commands.
mkdir ~/Desktop/reproduce; cd ~/Desktop/reproduce
svn checkout https://scm.escience.rpi.edu/svn/public/logd-csv2rdf4lod/data/source/data-gov/4383 source/data-gov/4383
cd source/data-gov/4383/version/
./retrieve.sh
cd 2011-Nov-02
The above commands will retrieve and enhance the original tabular data (skipping the raw layer, b/c that is useless). But how do we know we "got it right"? We can use cr-test-conversion.sh to run SPARQL query unit tests that are version controlled in the data skeleton, along with retrieve.sh
and the enhancement parameters. Testing is done within the conversion cockpit, just like conversion.
cd 2011-Nov-02
cr-test-conversion.sh --setup -v
will populate a TDB triple store in a local directory and run the unit tests against it:
../../rq/test/ask/present/alabama-lod-linked-directly-referenced.rq (Ask => Yes)
?address
con:stateOrProvince typed_state:Alabama, dbpedia:Alabama , govtrackus:AL ,
<https://sws.geonames.org/4829764/> ;
con:zip "36420";
.
typed_state:Alabama
dcterms:identifier "Alabama";
rdfs:label "Alabama";
owl:sameAs dbpedia:Alabama , govtrackus:AL , <https://sws.geonames.org/4829764/>
.
................................................................................
../../rq/test/ask/present/alabama-lod-linked-indirectly-referenced.rq (Ask => Yes)
?address
con:stateOrProvince typed_state:Alabama;
con:zip "36420";
.
typed_state:Alabama
dcterms:identifier "Alabama";
rdfs:label "Alabama";
owl:sameAs dbpedia:Alabama , govtrackus:AL , <https://sws.geonames.org/4829764/>
.
--------------------------------------------------------------------------------
2 of 2 passed
This enhanced layer is used in Alvaro's Farmers Markets demo:

New dataset: RPI administration handed us a spreadsheet a couple of weeks ago and said, "show it to us!"
- Google Spreadsheet: research_centers_web_v3_with_3yr_ave
- Version controlled data root: research-centers
- from here^^, you can see it has unit tests, a retrieve.sh, and TWO enhancement parameters.
- read more at Version control strategies: only the essential minimum is needed
Following Automated creation of a new Versioned Dataset again, we can follow the retrieve, convert, test cycle:
mkdir ~/Desktop/reproduce; cd ~/Desktop/reproduce
svn checkout https://scm.escience.rpi.edu/svn/public/logd-csv2rdf4lod/data/source/data-rpi-edu/research-centers/ source/data-rpi-edu/research-centers
cd source/data-rpi-edu/research-centers/version/
./retrieve.sh
cd 2011-Nov-02
cr-test-conversion.sh --setup --verbose
will end up with:
................................................................................
../../rq/test/ask/present/people-lod-link.rq (Ask => Yes)
<https://logd.tw.rpi.edu/source/data-rpi-edu/dataset/research-centers/typed/person/James_Hendler>
a research-centers_vocab:Faculty , foaf:Person ;
dcterms:identifier "James Hendler" ;
owl:sameAs <https://dbpedia.org/resource/James_Hendler> .
-\-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-!-/ \ \ \ FAIL / / /
../../rq/test/ask/present/person-not-a-person.rq (Ask => No)
<https://logd.tw.rpi.edu/demo/rpidemo/typed/person/Lucy_T_Zhang> a foaf:Person .
--------------------------------------------------------------------------------
5 of 7 passed
Enhancement Layer 1 looks like:
@prefix research-centers:
<https://logd.tw.rpi.edu/source/data-rpi-edu/dataset/research-centers/version/2011-Nov-02/> .
@prefix typed_person:
<https://logd.tw.rpi.edu/source/data-rpi-edu/dataset/research-centers/typed/person/> .
research-centers:researchCenter_5
dcterms:isReferencedBy <https://logd.tw.rpi.edu/source/data-rpi-edu/dataset/research-centers/version/2011-Nov-02>;
void:inDataset <https://logd.tw.rpi.edu/source/data-rpi-edu/dataset/research-centers/version/2011-Nov-02>;
a foaf:Organization , research-centers_vocab:ResearchCenter ;
e1:fund_home_organization_description
value_of_fund_home_organization_description:Center_for_Advanced_Interconnect_Systems_Technologies ;
e1:fund_home_portfolio_description "Vice President of Research" ;
e1:expenditures "113670"^^xsd:integer ;
foaf:member typed_person:Toh-Ming_Lu ,
typed_person:James_Lu ;
e1:core_facilities typed_facility:Clean_Room ;
e1:signature_thrust typed_thrust:Nanotech ,
typed_thrust:Energy_Envt ;
e1:school typed_school:School_of_Science_School_of_Engineering ;
foaf:member typed_person:David_Duquette ,
typed_person:Daniel_Gall ;
e1:funding_source_distribution "Corp 16.4%, Other 2.72%, State 80.88% " ;
e1:corporation "16.4"^^xsd:decimal ;
e1:state "80.88"^^xsd:decimal ;
e1:other "2.72"^^xsd:decimal ;
e1:average_expenditures "113,670" ;
e1:average_oh "8,500" ;
e1:acronym "CAIST" ;
ov:csvRow "5"^^xsd:integer .
Running ./convert-research-centers.sh -e 2
will produce Enhancement Layer 2, which [converts with cell based subjects](Converting with cell based subjects) to create RDF Data Cube-friendly RDF:
research-centers:expenditureProportion_5_14
dcterms:isReferencedBy <https://logd.tw.rpi.edu/source/data-rpi-edu/dataset/research-centers/version/2011-Nov-02>;
void:inDataset <https://logd.tw.rpi.edu/source/data-rpi-edu/dataset/research-centers/version/2011-Nov-02>;
a research-centers_vocab:ExpenditureProportion ;
e2:research_center typed_researchcenter:Center_for_Advanced_Interconnect_Systems_Technologies ;
e2:funding_type <https://logd.tw.rpi.edu/source/data-rpi-edu/dataset/research-centers/funding-type/Corporation> ;
rdf:value "0.16399999999999998"^^xsd:decimal ;
ov:csvRow "5"^^xsd:integer ;
ov:csvCol "14"^^xsd:integer ;
e2:acronym "CAIST" .
research-centers:expenditureProportion_5_16
dcterms:isReferencedBy <https://logd.tw.rpi.edu/source/data-rpi-edu/dataset/research-centers/version/2011-Nov-02>;
void:inDataset <https://logd.tw.rpi.edu/source/data-rpi-edu/dataset/research-centers/version/2011-Nov-02>;
a research-centers_vocab:ExpenditureProportion ;
e2:research_center typed_researchcenter:Center_for_Advanced_Interconnect_Systems_Technologies ;
e2:funding_type <https://logd.tw.rpi.edu/source/data-rpi-edu/dataset/research-centers/funding-type/State> ;
rdf:value "0.8088"^^xsd:decimal ;
ov:csvRow "5"^^xsd:integer ;
ov:csvCol "16"^^xsd:integer ;
e2:acronym "CAIST" .
research-centers:expenditureProportion_5_18
dcterms:isReferencedBy <https://logd.tw.rpi.edu/source/data-rpi-edu/dataset/research-centers/version/2011-Nov-02>;
void:inDataset <https://logd.tw.rpi.edu/source/data-rpi-edu/dataset/research-centers/version/2011-Nov-02>;
a research-centers_vocab:ExpenditureProportion ;
e2:research_center typed_researchcenter:Center_for_Advanced_Interconnect_Systems_Technologies ;
e2:funding_type <https://logd.tw.rpi.edu/source/data-rpi-edu/dataset/research-centers/funding-type/Other> ;
rdf:value "0.027200000000000002"^^xsd:decimal ;
ov:csvRow "5"^^xsd:integer ;
ov:csvCol "18"^^xsd:integer ;
e2:acronym "CAIST" .
- conversion:subclass_of is an enhancement that connects local vocabulary to popular vocabulary, making our data more interoperable.
conversion:enhance [
conversion:domain_name "ResearchCenter";
];
conversion:enhance [
conversion:class_name "ResearchCenter";
conversion:subclass_of foaf:Organization;
];
results in:
typed_researchcenter:Center_for_Advanced_Interconnect_Systems_Technologies
a research-centers_vocab:ResearchCenter, # <- A local class just created.
foaf:Organization; # <- A class that "everyone" recognizes.
dcterms:identifier "Center for Advanced Interconnect Systems Technologies";
rdfs:label "Center for Advanced Interconnect Systems Technologies";
- More LOD-Linking with conversion:links_via, this time with a SPARQL query against our Abstract Person Instance Hub named graph.
conversion:enhance [
ov:csvCol 4, 9, 10, 11;
conversion:equivalent_property foaf:member;
conversion:range_template "[/sd]typed/person/[.]";
rdfs:comment "lod-links from <https://logd.tw.rpi.edu/source/twc-rpi-edu/dataset/instance-hub-people>";
a conversion:CaseInsensitiveLODLink;
conversion:links_via # Sesame doesn't like redirects?: <https://purl.org/twc/query/instance-hub/intranet/people>;
<https://logd.tw.rpi.edu:8890/sparql?default-graph-uri=&query=PREFIX+foaf%3A++++%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2F%3E%0D%0APREFIX+dcterms%3A+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2F%3E%0D%0APREFIX+owl%3A+++++%3Chttp%3A%2F%2Fwww.w3.org%2F2002%2F07%2Fowl%23%3E%0D%0ACONSTRUCT+{+%3Fperson+dcterms%3Aidentifier+%3Fid+}%0D%0AWHERE+{%0D%0A++GRAPH+%3Chttp%3A%2F%2Flogd.tw.rpi.edu%2Fsource%2Ftwc-rpi-edu%2Fdataset%2Finstance-hub-people%3E++{%0D%0A++++%3Fp+a+foaf%3APerson%3B+owl%3AsameAs+%3Fperson+%3B+dcterms%3Aidentifier+%3Fid%0D%0A++}%0D%0A}&debug=on&timeout=&format=application%2Frdf%2Bxml>;
conversion:subject_of dcterms:identifier;
conversion:range rdfs:Resource;
];
which uses this SPARQL query:
PREFIX foaf: <https://xmlns.com/foaf/0.1/>
PREFIX dcterms: <https://purl.org/dc/terms/>
PREFIX owl: <https://www.w3.org/2002/07/owl#>
CONSTRUCT { ?person dcterms:identifier ?id }
WHERE {
GRAPH <https://logd.tw.rpi.edu/source/twc-rpi-edu/dataset/instance-hub-people> {
?p a foaf:Person; owl:sameAs ?person ; dcterms:identifier ?id
}
}
- cr-vars.sh and (CSV2RDF4LOD environment variables)
- publishing
- Named graph organization and Aggregating subsets of converted datasets

Questions:
- For each research center, how many foaf:members does it have?
- Number of members versus funding amount for each researcher?
Using the enhanced RDF created above, two web applications were created and connected by their common topic:
- https://logd.tw.rpi.edu/demo/rpidemo/start.html was created first to allow traditional navigation
-
https://logd.tw.rpi.edu/demo/rpidemo/static/research.html was created to provide an overview
- this overview provided links according to their URIs, which resolved to the pages at https://logd.tw.rpi.edu/demo/rpidemo/start.html
- Currently, this second demo is not "live" (in terms of being driven by live queries rather than static JSON). Alvaro's center detail pages are, however.
Clicking on "Center for Flow Physics and Control" will resolve its URI and redirect to HTML:
csv2rdf4lod is designed to allow third parties to incrementally improve the RDF representation of tabular data. The layering design that csv2rdf4lod embodies permits backward compatibility by ensuring monotonic assertions with subsequent layers. Since the demonstrations above already use layers 1 and 2, we don't want to prematurely change that structure.
But since they were created, we realized a better way to model the structure. We can press on and start a layer 3:
cd source/data-rpi-edu/research-centers/version/2011-Nov-02
./convert-research-centers.sh -e 3