Converting with cell based subjects

csv2rdf4lod-automation is licensed under the [Apache License, Version 2.0](https://github.com/timrdf/csv2rdf4lod-automation/wiki/License) diagram comparing row and cell based interpretations

diagram comparing row and cell based interpretations

Introduction

csv2rdf4lod lets you specify parameter for how one can interpret a tabular structure, so that anyone can produce well formed RDF that closely corresponds to the domain being modeled. Beyond providing a more natural structure for the resulting RDF, it provides very nice defaults for the URI naming scheme for entities mentioned within datasets and offers the flexibility to change the naming conventions used. This allows a bottom-up, incremental, and backward-compatible approach to integrating datasets from multiple datasets from multiple sources.

Background: binary vs. n-ary

The default interpretation is row-based, where a URI is minted from each row in the table, predicates are minted from the column headers, and values in cells cause a triple from the row URI to the cell value using the (binary) predicate (a.k.a., property/attribute) derived from the cell's column. The special sauce that csv2rdf4lod provides is a declarative way to express how this relatively trivial interpretation should be tweaked to make more natural representations (e.g., datatyping that string in the cell, promoting it to a (good) URI, restructuring the triple so it describes a different subject, drawing out the implicit entities being described (i.e., "normalization"), etc.). All of that happens with the enhancement parameters, which loosely correspond to the axioms in RDFS (and OWL) where appropriate -- the distinction is that RDFS and OWL assume RDF data and csv2rdf4lod handles arbitrary literals to get them to the RDF level.

The default interpretation just described creates binary relations from the row to the cell value, but some tabular structures are used to express n-ary relations where n is more than two. To interpret these correctly, we can switch from a row-based interpretation to a cell-based interpretation. To summarize:

Row-based  interpretation: table is expressing binary relations
Cell-based interpretation: table is expressing n-ary relations (n bigger than two)

For an authoritative discussion of modeling n-ary relationships in RDF, see http://www.w3.org/TR/swbp-n-aryRelations/.

A simple example illustrating the difference between binary (row-based) and n-ary (cell-based) interpretations

To illustrate the difference between tables that express binary relationships and tables that express n-ary relationships, we can consider a high school enrollment dataset.

In the binary example, we are listing the students and some of their properties (their grade level, their homeroom, and the classes they are enrolled in). Each column represents a different property and each cell represents a relationship from the student to the value in the cell.

In the n-ary example, we are listing the students and the courses they are enrolled in. Each column represents a course and each cell represents a 3-ary relationship between the student, the class, and whether or not they are attending.

To summarize

For binary (row-based) tables:

the subject of each triple is the "row" and you generate some URI for it
the predicate comes from the column header
and the object refered to by the predicate (i.e. colum header) is the item in the cell (row,column)

Writing the enhancement parameters

Type the conversion:Enhancement to qb:Observation (or scovo:Item)

Controlling the "up triple":

The conversion:label will now be used to name the predicate of the triple from the cell to the up value (instead of using it to name the predicate of the triple from the row to the cell value).
conversion:equivalent_property with a URI or template will override the ("up") property that would have been created with conversion:label.
If the conversion:object predicate is omitted, the object of the up triple will be a Resource named using the original column header. (hhs chsi e.g.)
- Note that this misses using header as a literal automatically, but one shouldn't be going out of their way to keep something a literal, especially something important enough to be listed in the header.
An conversion:object value of "[/sd]/value-of/[@]/[.]" will omit the subject discriminator when naming the Resource.
The conversion:object can be a template, e.g. conversion:object "[/sd]typed/council/[H]"; will type-promote the header outside of the subjectDiscrimiator.

Controlling the "out triple":

The predicate of the out triple is rdf:value by default. This can be overridden with an independent triple rdf:value conversion:equivalent_property foo:bar within the enhancements file.
conversion:range controls how the cell value is interpreted in the usual way (promoting to a resource, data typing, etc).
conversion:range_template controls how the cell value's URI is constructed in the usual way when the value is promoted to a resource.

When facing many cell-based subjects, the Script: cell ify params.awk can help automate to modify the enhancement parameters.

Examples

@prefix scovo:      <http://purl.org/NET/scovo#> .
@prefix conversion: <http://purl.org/twc/vocab/conversion/> .
@prefix :  <http://logd.tw.rpi.edu/source/nci-nih-gov/dataset/state-tobacco-tax/version/2001-Jan-01/params/enhancement/1/> .

:dataset a void:Dataset;
   conversion:base_uri           "http://logd.tw.rpi.edu"^^xsd:anyURI;
   conversion:source_identifier  "nci-nih-gov";
   conversion:dataset_identifier "state-tobacco-tax";
   conversion:dataset_version    "2010-Mar-29";
   conversion:conversion_process [
      a conversion:RawConversionProcess;
      conversion:enhancement_identifier "1";
      conversion:enhance [
         ov:csvRow 2;
         a conversion:HeaderRow;
      ];
      conversion:enhance [
         ov:csvRow 53;
         a conversion:DataEndRow;
      ];
      conversion:enhance [
         ov:csvCol         1;
         ov:csvHeader     "";
         conversion:label "State Order";
         conversion:range  xsd:integer;
         conversion:bundled_by [ ov:csvCol 2 ];
      ];
      conversion:enhance [
         ov:csvCol         2;
         ov:csvHeader     "";
         conversion:label "State"; 

         conversion:range  rdfs:Resource;

         conversion:range_name "State";

         conversion:links_via <http://www.rpi.edu/~lebot/lod-links/state-fips-dbpedia.ttl>,
                              <http://www.rpi.edu/~lebot/lod-links/state-fips-geonames.ttl>,
                              <http://www.rpi.edu/~lebot/lod-links/state-fips-govtrack.ttl>;
         conversion:subject_of dcterms:identifier;

         conversion:domain_name "Annual tax average";
      ];
      conversion:enhance [
         ov:csvCol         3;
         ov:csvHeader     "2000"; 

         a scovo:Item;
         conversion:label "Year";            # Property from cell URI to "2000"^^xsd:gYear
         conversion:object "2000"^^xsd:gYear;

         conversion:range  xsd:decimal; # Range of property "out of page"
      ];

@prefix state-tobacco-tax_vocab: <http://logd.tw.rpi.edu/source/nci-nih-gov/dataset/state-tobacco-tax/vocab/> .
@prefix raw:                     <http://logd.tw.rpi.edu/source/nci-nih-gov/dataset/state-tobacco-tax/vocab/raw/> .

state-tobacco-tax:thing_3 
   raw:column_1 "1" ;
   raw:column_2 "Alabama" ;
   <http://logd.tw.rpi.edu/source/nci-nih-gov/dataset/state-tobacco-tax/vocab/raw/2000> "16.5¢" ;
   ov:csvRow 3 .

becomes

@prefix e1: <http://logd.tw.rpi.edu/source/nci-nih-gov/dataset/state-tobacco-tax/vocab/enhancement/1/> .

state-tobacco-tax:annual_tax_average_3_3 
  a state-tobacco-tax_vocab:Annual_tax_average ;
  e1:state typed_state:Alabama ;
  e1:year   "2000"^^xsd:gYear ;
  rdf:value "16.5"^^xsd:decimal; # TODO: should be in e1.
  ov:csvRow "3"^^xsd:integer ;
  ov:csvCol "3"^^xsd:integer .

Candidates for cell-based conversion: Dataset 1612, Dataset 10030, Dataset 1554, Dataset 401, Dataset 402

(SEC company financial reports - http://viewerprototype1.com/viewer choose a company, and "export to Excel")