-
Notifications
You must be signed in to change notification settings - Fork 36
Triggers
- Automated creation of a new Versioned Dataset is a predecessor to the generalized trigger pattern described here.
- Triggers are a central part to producing Secondary Derivative Datasets.
This page describes how to use and create triggers in csv2rdf4lod-automation. Triggers are used to encapsulate the replication/reproduction of a stage of conversion:
Retrieval triggers generate files in a conversion cockpit's source/
directory.
#!/bin/bash
#
#3> @prefix doap: <http://usefulinc.com/ns/doap#> .
#3> @prefix dcterms: <http://purl.org/dc/terms/> .
#3> @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
#3>
#3> <> a conversion:RetrievalTrigger, doap:Project; # Could also be conversion:Idempotent;
#3> dcterms:description
#3> "Script to retrieve and convert a new version of the dataset.";
#3> rdfs:seeAlso
#3> <https://github.com/timrdf/csv2rdf4lod-automation/wiki/Automated-creation-of-a-new-Versioned-Dataset>,
#3> <https://github.com/timrdf/csv2rdf4lod-automation/wiki/tic-turtle-in-comments>;
#3> .
A retrieval trigger may be global, i.e. cross-version. In this case, the retrieval trigger also creates the conversion cockpit directory. It also includes the logic to determine if a new versions should or should not be created. In the former case, calling the retrieval trigger multiple times has no effect.
A retrieval trigger may apply only to a specific dataset version. This can apply when the source organization handles versioning themselves (e.g. sociam). The global retrieval trigger is much more popular than local retrieval triggers.
cr-retrieve.sh and cr-create-versioned-dataset-dir.sh can both be used as a template for writing your own retrieval trigger. cr-retrieve.sh is newer than cr-create-versioned-dataset-dir.sh, but they perform different functions.
- cr-create-versioned-dataset-dir.sh can only be run from directories
cr:directory-of-versions cr:conversion-cockpit
, where cr-retrieve.sh can be run from anywhere within the data root (cr:data-root cr:source cr:dataset cr:directory-of-versions cr:conversion-cockpit
) - cr-retrieve.sh accepts an argument
--skip-if-exists
to avoid retrieving if a version exists (here), while cr-create-versioned-dataset-dir.sh will determine what the version identifier should be and quit if it's the same (here). - cr-retrieve.sh will leverage DCAT access metadata if it is present (here), cr-create-versioned-dataset-dir.sh requires the download URL as a command line argument (here).
- cr-retrieve.sh will special handle Google Spreadsheet URLs (here), cr-create-versioned-dataset-dir.sh doesn't (and would require a whole new script for that handling google2source.sh)
- cr-retrieve.sh relies upon cr-create-versioned-dataset-dir.sh (here) to retrieve non-Google Spreadsheet URLs in DCAT access metadata.
- cr-retrieve.sh will defer to a custom retrieve.sh trigger if it's present (here).
- cr-create-versioned-dataset-dir.sh will perform file-specific handling on anything that is retrieved (e.g. unzipping zips, csv-ifying XLS, tidying HTML to valid XML, DROIDing for file formats, etc.); cr-retrieve.sh does not.
- cr-create-versioned-dataset-dir.sh will defer to a custom preparation trigger (here) if it exists.
- cr-create-versioned-dataset-dir.sh will pull the general conversion trigger
cr-convert.sh
after retrieving and preparing (here).
If the retrieval involves querying other SPARQL endpoints, consider using cache-queries.sh so that you can capture the provenance of the query.
(This paragraph has been superceded by Secondary Derived Datasets' enabling mechanism) Should the system-defined derived secondary datasets be enabled by creating their corresponding dataset directory in the data root (or, if enabled.txt
exists in that directory, since git won't commit empty directories...)? This would be much simpler than editing the cron job, and then the cron job would be able to attempt everything and each dataset would just skip if the directory does not exist. Then, the --enable
(or --force
?) flag could be used to create the dataset even if the directory doesn't exist. If we adopt this design, we'll need to do a lot of revisiting for the existing derived secondary datasets.
Remember to be polite when requesting:
#!/bin/bash
bps=$(($RANDOM%2000))
echo "bps $bps"
curl --limit-rate ${bps}K -L "$url" > "$here"
sec=$(($RANDOM%15))
echo "bps $bps; zzz $sec..."
sleep $sec
Trigger crib sheets are here.
I often use this wget idiom to create a wget
version, which can be considered an alias for "latest", then a tarball of that version can be made for any particular archive date as another version.
A global prepare.sh
(or, historically 2manual.sh
) can encapsulate intermediate tweaks that may be required before conversion. Preparation triggers use files in source/
and generate files in manual/
.
#!/bin/bash
#
#3> <> a conversion:PreparationTrigger; # Could also be conversion:Idempotent;
#3> foaf:name "prepare.sh";
#3> rdfs:seeAlso
#3> <https://github.com/timrdf/csv2rdf4lod-automation/wiki/Automated-creation-of-a-new-Versioned-Dataset>,
#3> <https://github.com/timrdf/csv2rdf4lod-automation/wiki/Triggers>,
#3> <https://github.com/timrdf/csv2rdf4lod-automation/wiki/Conversion-trigger>,
#3> <https://github.com/timrdf/csv2rdf4lod-automation/wiki/Conversion-cockpit>;
#3> .
#
# This script is responsible for processing files in source/ and storing their modified forms
# as files in the manual/ directory. These modified files should be ready for conversion.
#
# This script is also responsible for constructing the conversion trigger
# (e.g., with cr-create-conversion-trigger.sh -w manual/*.csv)
#
# When this script resides in a cr:directory-of-versions directory,
# (e.g. source/datahub-io/corpwatch/version)
# it is invoked by retrieve.sh (or cr-retrieve.sh).
# (see https://github.com/timrdf/csv2rdf4lod-automation/wiki/Directory-Conventions)
#
# When this script is invoked, the conversion cockpit is the current working directory.
#
- if
source/*.xsl
, then they will be converted tomanual/*.csv
because of this. -
../../src/html2csv.xsl
will convert asource/*.html.tidy
intomanual/*.csv
because of this.
Mapping source paths to automatic paths:
for owl in `find source -name "*.owl"`; do
turtle="automatic/${owl#source/}" && turtle="${turtle%.owl}.ttl"
#!/bin/bash
#
#3> <> a conversion:ConversionTrigger; # Could also be conversion:Idempotent;
#3> foaf:name "convert.sh";
#3> rdfs:seeAlso <https://github.com/timrdf/csv2rdf4lod-automation/wiki/Triggers#wiki-3-computation-triggers>;
#3> .
#
Types of computation:
- Transformation via csv2rdf4lod
- Invoking a local SADI service (e.g. from DataFAQs)
- scraping html
#!/bin/bash
#
#3> <> a conversion:PublicationTrigger; # Could also be conversion:Idempotent;
#3> foaf:name "publish.sh";
#3> rdfs:seeAlso <https://github.com/timrdf/csv2rdf4lod-automation/wiki/Triggers#wiki-4-publication-triggers>;
#3> .
#
cr-publish.sh can be used to find and pull all publication triggers. It can be dryrun with argument -n
, and it can be told to only pull triggers that promise to be idempotent with argument --idempotent
.
data/source$ cr-publish.sh -n --idempotent
if [[ "$1" == 'clean' ]]; then
echo rm source/reverts.rq.sparql*
rm source/reverts.rq.sparql*
if [[ "$2" == 'all' ]]; then
echo rm automatic/*
rm automatic/*
fi
exit
fi
-
./retrieve.sh clean
removes source/, automatic/ -
./prepare.sh clean
does not remove source/, but removes automatic/ ./convert.sh clean
./publish.sh clean
- -L1
- find . -type f -name "[^.]*" | xargs -P 4 -I {} rapper -g -c {} 2>&1 | grep returned > rapper.out
- find ../source -name "btc--chunk.gz" | xargs -n 1 p-and-c.sh -u | gzip > v.gz
- cr-convert.sh --xargs | xargs --max-args 1 --max-procs 2 bash
- ls -lt | awk '{print $8}' | grep 2014 | head -1 | xargs -I {} cat {}/source/opendap-provenance.ttl
- find tmp -maxdepth 1 -name *.mp3 -print0 | xargs -0 rm
- http://offbytwo.com/2011/06/26/things-you-didnt-know-about-xargs.html
Pattern to parallelize processing by calling ourselves recursively:
if [[ -e "$1" ]]; then
while [[ $# -gt 0 ]]; do
json="$1" && shift
ttl="$json.ttl"
if [[ ! -e $ttl ]]; then
mkdir -p `dirname $ttl`
fi
done
exit
else
# https://github.com/timrdf/csv2rdf4lod-automation/wiki/Triggers#parallelize-with-recursive-calls-via-xargs
find -L automatic -name *.json -print0 | \
xargs -0 -n 1 -P ${CSV2RDF4LOD_CONCURRENCY:-1} -I json $0 json
fi
[[ "$1" == '-n' ]] && dryrun='yes' && shift || dryrun=''
if [[ -n "$dryrun" ]]; then
echo "it's a not a dryrun!"
fi
- Triggers are a central part to producing Secondary Derivative Datasets.