CARVIEW |
This document reports on evidence and implementations of the Data on the Web Best Practices Candidate Recommendation. In particular, it demonstrates that the DWBP are already in use and are also implementable.
Introduction
One of the main goals of the Data on the Web Best Practices (DWBP) is to facilitate interaction between publishers and consumers of data on the Web. A set of 35 Best Practices were created to cover different challenges related to data publishing and consumption, such as Metadata, Data licenses, Data provenance, Data quality, Data versioning, Data identification, Data formats, Data vocabularies, Data access and APIs, Data preservation, Feedback, Data enrichment and Data republication.
To show that the DWBP are implementable as well broadly adopted and referenced by well-known organizations, we collected evidence in the form of datasets, data portals, documents, references and guidelines (Section 2). We used two forms to collect this evidence: (DWBP evidence form and DWBP template form). The results are summarized in this report.
Besides the results collected from the surveys, in order to strengthen the DWBP adoption evidence, we also present our evaluation of how DWBP are currently being adopted by the major data catalog solutions, including CKAN, Socrata, DKAN, JUNAR, ArcGIS Open Data and OPENDATASOFT (Section 3). Finally, we also present some examples to illustrate that each one of the DWBP is implementable (Section 4).
We followed the steps described below to collect evidence for the DWBP: As noted, to have a broader coverage of the DWBP adoption we considered different types of evidence:Methodology
Meeting the exit criteria
As described in the DWBP charter, to move on to Proposed Recommendation, evidence will be adduced in order to demonstrate that each of the best practices has been recommended or adopted in at least two environments, such as data portals and formal policies. Evidence of implementation was gathered from existing datasets and data portals, which already implement the proposed best practices, as well as from national or sector-specific guidelines that reference the DWBP and documents available on the Web.
DWBP Evidence
The table below shows the evidence collected for each one of the DWBP.
BP | Evidence | Total |
---|
Datasets, Data portals and Vocabularies
The following table shows organizations and implementers that contributed with DWBP evidence in the form of Datasets, Data Portals and Vocabularies.
ID | Organization Name | Evidence URI | Category | Domain | Data Catalog?* |
---|
* This column indicates if a data catalog solution is used to provide the data. The data catalog can be based on an existing solution like CKAN or can be a proprietary one.
Documents and References
The following table shows organizations and implementers that contributed with DWBP evidence in the form Documents and References.
ID | Organization | Evidence URI | Category |
---|
Guidelines
The following table shows organizations and implementers that contributed with DWBP evidence in the form Guidelines.
ID | Guide | Creator | Country | Year |
---|
General analysis
One of our main concerns when we started to collect evidence for each one of the DWBP was to have implementations from well-known organizations as well as high profile datasets and data portals worldwide, like DBpedia, Data.gov.uk, Data.gov and World Bank. Analyzing the tables presented in the previous section, we can say that we accomplished this goal. The DWBP evidence were collected from well-known organizations and projects including the ones mentioned before as well as BBC, Twitter, Europeana, Pacific Northwest National Laboratory and OpenStreetMaps. Considering the geographical coverage, we collected implementations from several countries, including Brazil, France, Ireland, New Zealand, Spain, UK, USA and Italy. It is also important to notice that evidence in the form of guidelines concerns several governmental organizations from Europe. Other important characteristic from the DWBP implementations is their broad domain coverage, e.g. they refer to different domains, like Government, Environment and Healthcare, as described in the graphic below.
As we can observe in the graphic below, there is a broad adoption of DWBP related to Metadata (BP1 and BP2), Data Licenses (BP4), Data Identification (BP9 and BP10), Data Formats (BP12 and BP14), Vocabularies (BP15 and BP16), Data Access (BP23, BP24, BP25 and BP26) and Feedback (BP29). On the other hand, for others, such as Preserve identifiers (BP27), Assess dataset coverage (BP28), Provide real-time access (BP20) and Provide an explanation for data that is not available (BP22), collection of evidence was more difficult, especially related to datasets and data portals. This can be justified by comments received during the evidence gathering process and also available in the DWBP evidence form. Bill Roberts from the SWIRRL, for example, made the following comment about one of the Data Preservation best practices: "Too difficult to test in a meaningful way. In this system, no datasets have yet been taken offline, so the archiving process has not been developed." In the same way, he made a comment about the Best Practice Provide real-time access: "The system does not currently hold dataset collected in 'real time'. Generally the data is statistical in nature and goes through a slower collection and processing cycle."
DWBP and Data Catalogs
In this section we present some more evidence that shows the adoption of the DWBP. Rather than specific datasets or data portals, we use the following data catalog solutions as evidence: CKAN, Socrata, DKAN, JUNAR, ArcGIS Open Data and OPENDATASOFT. For each one of the DWBP, we show the list of data catalog solutions that implement it.
BP | Data Catalogs | Total |
---|
As we may notice, there is no evidence for some of the DWBP. This happens because these Best Practices do not concern the solution used for making the data available on the Web, e.g. the data catalog solution, as explained below.
- BP10, BP16, BP22, BP28: these BP apply to the data itself rather than the data catalog solution used to publish the data.
- BP33, B34, BP35: these BP apply to situations of data republication, i.e. it depends from the consumer rather than the data catalog solution used to publish the data.
- BP31: this BP concerns processes that can be used to enhance, refine or otherwise improve raw or previously processed data, which are not part of the basic data catalog functions.
Concerning BP27 none of the data catalog solutions implement it. In general, when a dataset is not available then just a 404 error message is returned.
Some Best Practices related to metadata are partially implemented by the data catalog solutions. Note that almost all data catalog solutions are compatible with DCAT, which means that metadata covered by DCAT may be completely or partially available both in human-readable and machine-readable formats. In general, it means that just a human-readable or a machine-readable version of the metadata is available, as detailed in the following.
- BP3 is partially implemented by CKAN and JUNAR because they do not offer an explicit way to present human-readable structural metadata.
- BP4 and BP5 are partially implemented by ARCGIS OPEN DATA and OPENDATASOFT because it does not offer a way to represent machine-readable license metadata and machine-readable provenance metadata.
- BP8 is partially implemented by SOCRATA because it does not offer a way to represent machine-readable version history metadata.
- BP13 is partially implemented by OPENDATASOFT because it does not offer a way to represent machine-readable language metadata.
As a general analysis with regards to the Data on the Web Challenges, we can say that Metadata, Data Licenses and Data Formats challenges are a main concern of the data catalog solutions. The Data Access challenge has also been recognized as an important one except when it concerns real-time data. The use of Data Access APIs is a consensus. The major data catalog solutions also deal with the Data Identification challenge, however just part of the problem has been solved. The Data Vocabularies challenge has also been considered as an important one since data catalog solutions reuse existing vocabularies, e.g. DCAT, when publishing metadata about the data catalogs. Other challenges like Data Provenance, Data Versioning and Feedback have been superficially dealt with in the data catalog solutions. In general, Data Quality, Data Preservation, Data Enrichment and Data Republications are challenges still not explored by the major data catalog solutions.
Set of Best Practices
The following list shows the set of best practices linked to the DWBP document:
- Best Practice 1: Provide metadata
- Best Practice 2: Provide descriptive metadata
- Best Practice 3: Provide structural metadata
- Best Practice 4: Provide data license information
- Best Practice 5: Provide data provenance information
- Best Practice 6: Provide data quality information
- Best Practice 7: Provide a version indicator
- Best Practice 8: Provide version history
- Best Practice 9: Use persistent URIs as identifiers of datasets
- Best Practice 10: Use persistent URIs as identifiers within datasets
- Best Practice 11: Assign URIs to dataset versions and series
- Best Practice 12: Use machine-readable standardized data formats
- Best Practice 13: Use locale-neutral data representations
- Best Practice 14: Provide data in multiple formats
- Best Practice 15: Reuse vocabularies, preferably standardized ones
- Best Practice 16: Choose the right formalization level
- Best Practice 17: Provide bulk download
- Best Practice 18: Provide Subsets for Large Datasets
- Best Practice 19: Use content negotiation for serving data available in multiple formats
- Best Practice 20: Provide real-time access
- Best Practice 21: Provide data up to date
- Best Practice 22: Provide an explanation for data that is not available
- Best Practice 23: Make data available through an API
- Best Practice 24: Use Web Standards as the foundation of APIs
- Best Practice 25: Provide complete documentation for your API
- Best Practice 26: Avoid Breaking Changes to Your API
- Best Practice 27: Preserve identifiers
- Best Practice 28: Assess dataset coverage
- Best Practice 29: Gather feedback from data consumers
- Best Practice 30: Make feedback available
- Best Practice 31: Enrich data by generating new data
- Best Practice 32: Provide Complementary Presentations
- Best Practice 33: Provide Feedback to the Original Publisher
- Best Practice 34: Follow Licensing Terms
- Best Practice 35: Cite the Original Publication
Ackownledgements
The editors gratefully acknowledge the contributions made to gathering evidence for the DWBP by all members of the working group. Especially Annette Greiner, Antoine Isaac, Carlos Laufer, Christophe Guéret, Deirdre Lee, Eric Stephan, Makx Dekkers, Martin Alvarez-Espinar, Peter Winstanley, Phil Archer and Riccardo Albertoni.
The editors would also like to thank evidences received from Bill Roberts, Christophe Guéret, Diogo Cortiz, Fábio Rodrigues, Eduardo Rodrigues Vasconcelos, Gregor Boyd, Herbert Van de Sompel, Jefferson Rafael Silva, João Victor Pacheco Dias, José Marcio Martins Junior, Laura Manley, Markus Freudenberg, Milos Jovanovik, Rafael Sá Anselmo, Reinaldo Ferraz and Williams Alcântara.