-
Notifications
You must be signed in to change notification settings - Fork 36
Aggregating subsets of converted datasets
- Conversion process phase: publish shows the different aggregations that are performed when preparing to publish dump files.
This page describes how to select different subsets of the all RDF produced during conversion, so that smaller portions are widely available without imposing the overhead of loading all data triples for datasets.
- Jump to Development notes
Five types of aggregations are placed in different named graphs for easy access. All queries shown on this page can be executed at http://logd.tw.rpi.edu/sparql. See also Querying datasets created by csv2rdf4lod.
- Aggregating DCAT metadata - descriptions for how to access the data files.
- Aggregating DROID file metadata - Filetypes of any retrieved files.
- Aggregating Datasets' Conversion Metadata - Provenance and metadata created from retrieval, tweaking, conversion, and aggregation.
- Sitemap - robots.txt
- Aggregating owl:sameAs links - All owl:sameAs triples.
- Aggregating MetaDatasets - Datasets that describe datasets.
- Aggregating rdfs:isDefinedBy - Associating every property and class to it's vocabulary namespace.
- Aggregating Turtle-in-comments - metadata embedded in comments of other files.
- Aggregating full dump - everything.
- (todo) graphics that have been created from the data.
- (Deriving datasets from existing datasets is discussed at Secondary Derivative Datasets)
All aggregations described on this page follow the convention that their results produce a new version of a new dataset. All members of the aggregation are collected into a new directory:
CSV2RDF4LOD_BASE_URI/source/CSV2RDF4LOD_PUBLISH_OUR_SOURCE_ID/dataset/cr-publish-XXX-to-endpoint/version/TODAY/source
where XXX
is expanded for the script name and TODAY
is the current date in the form YYYY-Mon-DD
. A sibling publish/
directory is created to hold the aggregation of the files collected in source/
, and scripts in publish/bin/
are created and used to load the aggregation into the triple store. This pattern follows that of the conventions for the conversion cockpit.
cr-publish-dcat-to-endpoint.sh creates a new version of the abstract dataset:
CSV2RDF4LOD_BASE_URI/source/CSV2RDF4LOD_PUBLISH_OUR_SOURCE_ID/dataset/cr-publish-dcat-to-endpoint
(where the [variables](CSV2RDF4LOD environment variables) in capitals are expanded). The script must be run from a [cr:data-root](directory conventions), e.g. /srv/twc-healthdata/data/source
. It aggregates all files *dcat.ttl
in the [cr:dataset](directory conventions) and [cr:directory-of-versions](directory conventions) directories. These dcat files reference the data file download URLs for the dataset that the current directory represents. They can be created by cr-create-dataset-dirs-from-ckan.py and are recognized and acted upon by cr-retrieve.sh.
% pwd
/srv/twc-healthdata/data/source
% cr-pwd-type.sh
cr:data-root
% cr-vars.sh | grep OUR_SOURCE_ID
CSV2RDF4LOD_PUBLISH_OUR_SOURCE_ID tw-rpi-edu
% cr-publish-dcat-to-endpoint.sh -n
8 . hub-healthdata-gov/third-national-survey-older/dcat.ttl
8 . hub-healthdata-gov/skilled-nursing-facility-medicare/dcat.ttl
8 . hub-healthdata-gov/skilled-nursing-facility-medicare-cost/dcat.ttl
8 . hub-healthdata-gov/skilled-nursing-facility-medicare-cost-report/dcat.ttl
8 . hub-healthdata-gov/skilled-nursing-facility-medicare-cost-report-data/dcat.ttl
8 . hub-healthdata-gov/skilled-nursing-facility-medicare-cost-report-data-fy2011/dcat.ttl
...
publish/tw-rpi-edu-cr-publish-dcat-to-endpoint-2012-Sep-17.nt
publish/tw-rpi-edu-cr-publish-dcat-to-endpoint-2012-Sep-17.sd_name
publish/tw-rpi-edu-cr-publish-dcat-to-endpoint-2012-Sep-17.void.ttl
% cd tw-rpi-edu/cr-publish-dcat-to-endpoint/version/2012-Sep-17
% ls source/*dcat.ttl | wc -l
136
% ls -lt publish/
total 432
drwxr-xr-x 5 lebot staff 170 Sep 17 23:54 bin
-rw-r--r-- 1 lebot staff 1325 Sep 17 23:54 tw-rpi-edu-cr-publish-dcat-to-endpoint-2012-Sep-17.void.ttl
-rw-r--r-- 1 lebot staff 209888 Sep 17 23:54 tw-rpi-edu-cr-publish-dcat-to-endpoint-2012-Sep-17.nt
-rw-r--r-- 1 lebot staff 97 Sep 17 23:54 tw-rpi-edu-cr-publish-dcat-to-endpoint-2012-Sep-17.sd_name
cr-publish-void-to-endpoint.sh creates a new version of the abstract dataset:
CSV2RDF4LOD_BASE_URI/source/CSV2RDF4LOD_PUBLISH_OUR_SOURCE_ID/dataset/cr-publish-void-to-endpoint
(where the [variables](CSV2RDF4LOD environment variables) in capitals are expanded). The script must be run from a [cr:data-root](directory conventions), e.g. /srv/twc-healthdata/data/source
. It aggregates all conversion cockpits' source/*.void.ttl
files, which contain provenance of the retrieval, tweaking, conversion (including enhancement parameters), and aggregation process as well as VoID and DC Terms metadata that is produced by the converter.
Note that $CSV2RDF4LOD_PUBLISH_SUBSET_VOID_NAMED_GRAPH used to determine the graph before the "Create a new version of the abstract dataset" convention was established, but this environment variable is now deprecated. The naming of this script (void
) is inaccurate because it provides a lot more metadata than just VoID -- including provenance.
PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
SELECT distinct ?p
WHERE {
GRAPH <http://logd.tw.rpi.edu/vocab/Dataset> {
<http://logd.tw.rpi.edu/source/nitrd-gov/dataset/nsf_awards/version/2011-Jan-27> ?p ?o
}
} order by ?p
The data is loaded into a named graph named after the dataset:
PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
SELECT distinct ?p
WHERE {
GRAPH <http://logd.tw.rpi.edu/source/nitrd-gov/dataset/nsf_awards/version/2011-Jan-27> {
?s ?p ?o
}
} order by ?p
Note that cr-publish-params-to-endpoint.sh used to load into $CSV2RDF4LOD_PUBLISH_CONVERSION_PARAMS_NAMED_GRAPH until the "Create a new version of the abstract dataset" convention.
Datasets that promote their properties to geonames:parentFeature: prefix geonames: http://www.geonames.org/ontology# prefix conversion: http://purl.org/twc/vocab/conversion/
select ?dataset count(*) as ?count
where {
graph <http://purl.org/twc/vocab/conversion/ConversionProcess> {
?dataset conversion:conversion_process [
conversion:enhancement_identifier ?e;
conversion:enhance [
conversion:subproperty_of geonames:parentFeature
]
]
}
}
group by ?dataset ?e
order by ?count
Superproperties referenced: prefix geonames: http://www.geonames.org/ontology# prefix conversion: http://purl.org/twc/vocab/conversion/
select distinct ?superproperty
where {
graph <http://purl.org/twc/vocab/conversion/ConversionProcess> {
?dataset conversion:conversion_process [
conversion:enhancement_identifier ?e;
conversion:enhance [
conversion:subproperty_of ?superproperty
]
]
}
}
order by ?superproperty
TODO: demonstrate query that accesses dataset's data based on it's enhancement params. Requires a conversion using latest build b/c the enhancement params URI was changed to the actual dataset URI for easier connection.
loaded into $CSV2RDF4LOD_PUBLISH_SUBSET_SAMEAS_NAMED_GRAPH (e.g. http://purl.org/twc/vocab/conversion/SameAsDataset) by $CSV2RDF4LOD_HOME/
bin/cr-publish-sameas-to-endpoint.sh.
http://logd.tw.rpi.edu/query/logd-stat-num-outlinks.sparql: prefix owl: http://www.w3.org/2002/07/owl#
SELECT count(*) as ?count
WHERE {
graph <http://purl.org/twc/vocab/conversion/SameAsDataset> {
?s owl:sameAs ?o
}
filter( ! ( regex(str(?s),"^http://logd.tw.rpi.edu*")
&& regex(str(?o),"^http://logd.tw.rpi.edu*") )
)
}
http://logd.tw.rpi.edu/query/logd-stat-num-outlinks-govtrack.sparql: prefix owl: http://www.w3.org/2002/07/owl#
SELECT count(*) as ?count
WHERE {
graph <http://purl.org/twc/vocab/conversion/SameAsDataset> {
?s owl:sameAs ?o
}
filter(regex(str(?o),"^http://www.rdfabout.com/rdf/usgov*"))
}
Queries about links to other bubbles in the LOD cloud can be done using the query above and changing the regex:
http://logd.tw.rpi.edu/query/logd-stat-num-outlinks-geonames.sparql: filter(regex(str(?o),"^http://sws.geonames.org*"))
http://logd.tw.rpi.edu/query/logd-stat-num-outlinks-dbpedia.sparql: filter(regex(str(?o),"^http://dbpedia.org/resource*"))
TODO: links among VersionedDataset URIs vs. links between VersionedDataset URIs (b/c of owl:sameAs between layers).
TODO: this should return stuff:
prefix void: <http://rdfs.org/ns/void#>
prefix conversion: <http://purl.org/twc/vocab/conversion/>
select distinct ?dataset ?dump
where {
graph ?g {
?dataset a conversion:SameAsDataset; void:dataDump ?dump .
}
} order by ?dataset
Some datasets are actually describing other datasets. For example, data.gov's 92 describes all of data.gov's other "raw" datasets. All Metadata datasets are loaded (by $CSV2RDF4LOD/
bin/util/cr-virtuoso-load-metadataset.sh) into a special named graph so they can be accessed to augment dataset descriptions.
source/data-gov/92/version/data_gov_catalog.csv.e1.params.ttl adds the types a conversion:DatasetCatalog, conversion:MetaDataset;
in its global enhancement parameters:
<http://logd.tw.rpi.edu/source/data-gov/dataset/92/version/2011-Jul-11/conversion/enhancement/1>
a conversion:LayerDataset, void:Dataset;
conversion:base_uri "http://logd.tw.rpi.edu"^^xsd:anyURI;
conversion:source_identifier "data-gov";
conversion:dataset_identifier "92";
conversion:version_identifier "2011-Jul-11";
a conversion:DatasetCatalog, conversion:MetaDataset;
The following query lists datasets that have been converted, but have not been described by another dataset (via the csv2rdf4lod converter, which asserts the ov:csvRow
). Thanks to Greg for tweaking the query for efficiency.
PREFIX void: <http://rdfs.org/ns/void#>
PREFIX ov: <http://open.vocab.org/terms/>
PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
SELECT ?source_id ?dataset_id ?version_id
WHERE {
GRAPH <http://logd.tw.rpi.edu/vocab/Dataset> {
# Datasets that have been converted
?converted a conversion:VersionedDataset;
conversion:source_identifier ?source_id ;
conversion:dataset_identifier ?dataset_id ;
conversion:version_identifier ?version_id .
}
OPTIONAL {
GRAPH <http://purl.org/twc/vocab/conversion/MetaDataset> {
# But we have no metadata for it.
?converted ov:csvRow ?row
}
}
filter(!bound(?row))
} order by ?source_id ?dataset_id ?version_id
TODO: replace directory processing with query in logd-load-metadata-graph.sh
:
PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT distinct ?metadata
WHERE {
GRAPH <http://logd.tw.rpi.edu/vocab/Dataset> {
?metadata a conversion:MetaDataset;
conversion:conversion_process [] .
}
}
Aug 2011 query for DatasetCatalogs (which are MetaDatasets). The are typed at the Abstract and atomic levels. (results:
PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
SELECT distinct ?sourceID ?datasetID
WHERE {
GRAPH ?g {
?abstract a conversion:DatasetCatalog;
conversion:source_identifier ?sourceID;
conversion:dataset_identifier ?datasetID .
}
}ORDER BY ?sourceID
(results)
PREFIX void: <http://rdfs.org/ns/void#>
PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
SELECT distinct ?sourceID ?datasetID
WHERE {
GRAPH ?g {
{?abstract
conversion:source_identifier ?sourceID;
conversion:dataset_identifier ?datasetID;
void:subset [ a conversion:VersionedDataset;
void:subset [
# Datasets from one source data file
a conversion:DatasetCatalog
];
] .
}
UNION {
?abstract
conversion:source_identifier ?sourceID;
conversion:dataset_identifier ?datasetID;
void:subset [ a conversion:VersionedDataset;
void:subset [
void:subset [
# Datasets from multiple source data files
a conversion:DatasetCatalog;
];
];
] .
}
}
}ORDER BY ?sourceID
See cr-isdefinedby.
> cr-pwd-type.sh
cr:data-root
> cr-publish-tic-to-endpoint.sh cr:auto
healthdata-tw-rpi-edu/catalog/version/2012-Sep-19/publish/bin/virtuoso-delete-healthdata-tw-rpi-edu-catalog-2012-Sep-19.sh
healthdata-tw-rpi-edu/catalog/version/2012-Sep-19/publish/bin/virtuoso-load-healthdata-tw-rpi-edu-catalog-2012-Sep-19.sh
healthdata-tw-rpi-edu/catalog/version/retrieve.sh
hub-healthdata-gov/2008-basic-stand-alone-carrier/dcat.ttl
hub-healthdata-gov/2008-basic-stand-alone-durable/dcat.ttl
hub-healthdata-gov/2008-basic-stand-alone-home/dcat.ttl
hub-healthdata-gov/2008-basic-stand-alone-hospice/dcat.ttl
hub-healthdata-gov/2008-basic-stand-alone-inpatient/dcat.ttl
hub-healthdata-gov/2008-basic-stand-alone-outpatient/dcat.ttl
hub-healthdata-gov/2008-basic-stand-alone-prescription/dcat.ttl
hub-healthdata-gov/2008-basic-stand-alone-skilled/dcat.ttl
...
> find tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/source/10.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/source/100.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/source/101.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/source/102.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/source/103.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/source/104.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/source/105.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/source/106.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/source/107.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/source/108.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/source/109.ttl
...
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/automatic/10.ttl.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/automatic/100.ttl.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/automatic/101.ttl.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/automatic/102.ttl.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/automatic/103.ttl.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/automatic/104.ttl.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/automatic/105.ttl.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/automatic/106.ttl.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/automatic/107.ttl.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/automatic/108.ttl.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/automatic/109.ttl.ttl
...
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/publish/bin/ln-to-www-root-tw-rpi-edu-cr-publish-tic-to-endpoint-2012-Sep-27.sh
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/publish/bin/virtuoso-delete-tw-rpi-edu-cr-publish-tic-to-endpoint-2012-Sep-27.sh
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/publish/bin/virtuoso-load-tw-rpi-edu-cr-publish-tic-to-endpoint-2012-Sep-27.sh
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/publish/tw-rpi-edu-cr-publish-tic-to-endpoint-2012-Sep-27.nt
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/publish/tw-rpi-edu-cr-publish-tic-to-endpoint-2012-Sep-27.sd_name
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/publish/tw-rpi-edu-cr-publish-tic-to-endpoint-2012-Sep-27.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/publish/tw-rpi-edu-cr-publish-tic-to-endpoint-2012-Sep-27.void.ttl
tw-rpi-edu/cr-publish-tic-to-endpoint/version/2012-Sep-27/publish/tw-rpi-edu-cr-publish-tic-to-endpoint-latest.ttl
The LOGD SPARQL endpoint has three special named graphs:
- http://logd.tw.rpi.edu/vocab/Dataset contains information about the LOGD datasets that was asserted during conversion to RDF. This includes the VoID subset hierarchy and dataDumps, SCOVO triple counts, references to (and definitions of) the predicates and classes used, and some PML justifications tracing the provenance of the tabular conversions to RDF.
- http://purl.org/twc/vocab/conversion/MetaDataset contains information about datasets obtained from other sources. For example, it includes data.gov's Dataset 92 because it describes the rest of data.gov's offerings. A second dataset is TWC's own data catalog that describes similar aspects for datasets from other sources.
- http://purl.org/twc/vocab/conversion/SameAsDataset contains owl:sameAs links among entities within the LOGD datasets as well as into DBPedia, Geonames, and GovTrack. All of the links are co-located in a single graph to help explore the interconnectivity of the LOGD datasets.
Starred *
scripts are exemplars for the pattern; fewer (non-zero) pluses +
means developed more recently. (When a new entry appears, add a +
to every other existing entry).
(See this list, too)
-
pr-whois-domain.sh
- Adapted from pr-neighborlod.sh
- Set URL explicitly to the URI node dump.
- Hid SPARQL querying rq and rq2
- Hid DROIDing
- Never
$worthwhile
- Hid
$worthwhile
cleanup - FORGETs what version it should create when it's placed in a conversion cockpit.
- Adds destruction of version before marching on to do it again.
-
pr-aggregate-pingbacks.sh +
- Adopted Aggregation exemplar, but need to add verification functionality.
- Had to switch from
PATH=$PATH'$HOME/bin/util/cr-situate-paths.sh'
toPATH=$PATH'$HOME/bin/install/paths.sh'
when moving from csv2rdf2lod to Prizms.
-
opendap-svn-file-hierarchy ++
- Adapted from pr-neighborlod.sh; took out $rq2 handling, soft link PATH handling.
-
bin/dataset/cr-sparql-sd.sh +++
- Adopted trimmed version of Aggregation exemplar.
- Removed idempotency; we only want it to be run once.
- data-carved-graphs-btes ++++
-
bin/dataset/cr-aggregate-dcat.sh (Aggregation exemplar)++++
- reused softlink-safe
$this
logic - reused dryrun conditional
- cleans out version (as opposed to new dataset exemplar, which increments)
- reused softlink-safe
-
bin/dataset/pr-neighborlod.sh (New dataset exemplar)++++
- updated
$this
and$HOME
logic when is a soft link, augmentsPATH
andCLASSPATH
in-line. - updated the "retrieve from local endpoint" pattern to a variable for the query file.
- removed check to prevent attempt to make worthwhile version (removes itself if not worthwhile)
- modifies SPARQL template before execution
- swaps SPARQL query from subject-based to object-based when the former runs dry.
- increments version (as opposed to aggregate exemplar, which increments)
- updated
-
bin/secondary/cr-aggregate-eparams.sh +++++
- pushd conversion root
- more complete "SDV" naming logic
- removed all graph naming/clearing clutter
- handles "$0" when is a soft link
- includes the
#3> <> a conversion:RetrievalTrigger;
that pr-enable-dataset.sh needs to list.
-
pr-spobal-ng.sh ++++++
- removes retrieval attempt if it did not become worthwhile.
- should be extended to accept the sd:name to process.
- WCL's asset-alchemyapi/retrieve.sh +++++++
- accepts the URI to analyze, or uses cache-queries.sh to SPARQL query for those that need to be analyzed.
- WCL's property-chains/retrieve.sh ++++++++
- retrieves with cache-queries.sh
- recursively generates versions until no triples returned
- cheats on loading (vload) - needs to be cleaned up.
-
bin/cr-pingback.sh ++++++*+++
- does not depend on
CSV2RDF4LOD_HOME
; - can run from source directory;
- runs without "cr:auto" argument;
- CUT a lot of the file aggregation stuff;
- bails if run within last week -- see cr-publish-droid-to-endpoint.sh or older
- does not depend on
-
bin/util/cr-full-dump.sh ++++++++++
- avoids using aggregate-source-rdf.sh
-
bin/cr-publish-droid-to-endpoint.sh +++++++++++
- uses
# - - - -
to delineate the source/* linking; - links into source/ via "sdv".ttl
- uses
- bin/cr-publish-isdefinedby-to-endpoint.py +++++++++++++
- bin/cr-publish-isdefinedby-to-endpoint.sh +++++++++++++
- checks for $CSV2RDF4LOD_PUBLISH_SPARQL_ENDPOINT;
- wraps python;
- uses aggregate-source-rdf.sh
--link-as-latest
; - works from
cr:data-root cr:source cr:dataset cr:directory-of-versions
not justcr:data-root cr:source
- bin/cr-publish-cockpit.sh +++++++++++++
- just hops into cockpit and runs
convert-aggregate.sh
- just hops into cockpit and runs
- bin/cr-publish-params-to-endpoint.sh +++++++++++
- uses aggregate-source-rdf.sh
--link-as-latest
- uses aggregate-source-rdf.sh
-
bin/cr-publish-tic-to-endpoint.sh ++++++++++++
- links into source/ by for loop tally;
- processes source into automatic
- bin/cr-publish-void-to-endpoint.sh ++++++++++++
- uses
--link-as-latest
- uses
- bin/cr-publish-dcat-to-endpoint.sh ++++++++++++
- uses publish/bin to load/delete graph;
- uses dryrun.sh $dryrun ending;
- uses aggregate-source-rdf.sh (not link latest);
- links into source/"$sdv".ttl with a for loop
- bin/cr-publish-sameas-to-endpoint.sh +++++++++++++
- still did its own graph loading with vload;
- should use aggregate-source-rdf.sh
The pattern for all of these scripts is:
dryrun.sh $dryrun beginning
- Invoke from the conversion data root (i.e., cr-pwd-type.sh
cr:data-root
) - Use [env var](CSV2RDF4LOD environment variables)
$CSV2RDF4LOD_PUBLISH_OUR_SOURCE_ID
to know which source (organization) directory to create the aggregate dataset. - Make a cockpit for a new versioned dataset. e.g.
source/tw-rpi-edu/dataset/cr-publish-dcat-to-endpoint/version/2012-Sep-07
- Hard link files into the new cockpit's
source/
directory. - Aggregate
source/*
intopublish/*
withaggregate-source-rdf.sh --link-as-latest source/*
- Use the
publish/bin/*
scripts to publish like a normal dataset. dryrun.sh $dryrun ending
Each type of aggregation is described below and is performed by $CSV2RDF4LOD_HOME/
bin/util/cr-virtuoso-load-metadata.sh, whose behavior is controlled by changing certain CSV2RDF4LOD environment variables:
-
CSV2RDF4LOD_CONVERT_DATA_ROOT
- the [data root](csv2rdf4lod data root) from which to aggregate and publish. See [this](Publishing conversion results with a Virtuoso triplestore), too. -
CSV2RDF4LOD_PUBLISH_LOD_MATERIALIZATION_WWW_ROOT
- the /var/www directory that publishes files to the web. -
CSV2RDF4LOD_PUBLISH_VIRTUOSO_SPARQL_ENDPOINT - the endpoint that
${CSV2RDF4LOD_HOME}/
bin/util/virtuoso/vload populates.
- Queries using non-subsets of data are at Querying datasets created by csv2rdf4lod.
- Namespace prefix handling
- Secondary Derivative Datasets is a more informative sibling of aggregating subsets of converted datasets.