-
Notifications
You must be signed in to change notification settings - Fork 36
Script: pcurl.py
$CSV2RDF4LOD_HOME/bin/util/pcurl.py is Jim McCusker's reimplemention of pcurl.sh to include FRBR stacks and HTTP-in-RDF. He has included it as part of csv2rdf4lod-automation. Applications of this utility are described in the following publications:
bash-3.2$ pcurl.py --help
usage: pcurl.py [--help|-h] [--format|-f xml|turtle|n3|nt] [url ...]
Download a URL and compute Functional Requirements for Bibliographic Resources
(FRBR) stacks using cryptograhic digests for the resulting content.
Refer to http://purl.org/twc/pub/mccusker2012parallel
for more information and examples.
optional arguments:
url url to compute a FRBR stack for.
-h, --help Show this help message and exit,
-f, --format File format for FRBR stacks. One of xml, turtle, n3, or nt.
fstack.py
is closely associated to pcurl.py
. While pcurl.py
is used to retrieve a URL and including its FRBR stack, fstack.py
can be used to create a FRBR stack of an existing local file.
bash-3.2$ fstack.py --help
usage: fstack.py [--help|-h] [--stdout|-c] [--format|-f xml|turtle|n3|nt] [--print-item] [--print-manifesation] [--print-expression] [--print-work] [-] [file ...]
Compute Functional Requirements for Bibliographic Resources (FRBR)
stacks using cryptograhic digests.
Refer to http://purl.org/twc/pub/mccusker2012parallel
for more information and examples.
optional arguments:
file File to compute a FRBR stack for.
- Read content from stdin and print FRBR stack to stdout.
-h, --help Show this help message and exit,
-c, --stdout Print frbr stacks to stdout.
--no-paths Only output path hashes, not actual paths.
-f, --format File format for FRBR stacks. xml, turtle, n3, or nt.
--print-item Print URI of the Item and quit.
--print-manifestation Print URI of the Manifestation and quit.
--print-expression Print URI of the Expression and quit.
--print-work Print URI of the Work and quit.
The following command will retrieve the latest pcurl.py script and store it to a file in your current directory. The script will include a second file describing the provenance of the one retrieved.
bash-3.2$ pcurl.py https://raw.github.com/timrdf/csv2rdf4lod-automation/master/bin/util/pcurl.py
bash-3.2$ ls
pcurl.py.prov.ttl pcurl.py
If something happens to the file you retrieved (e.g., a file copy or rename), $CSV2RDF4LOD_HOME//bin/util/fstack.py can be used to recognize an association between the downloaded file and the one we see now:
bash-3.2$ cp pcurl.py mypcurl.py
bash-3.2$ fstack.py mypcurl.py
bash-3.2$ ls
pcurl.py.prov.ttl pcurl.py mypcurl.py mypcurl.py.prov.ttl
To see that the different files pcurl.py
and mypcurl.py
have the same bitstream, we can look at the snippets of the FRBR stacks shown below and compare the frbr:Manifestation referenced by the frbr:exemplarOf
predicate. pcurl.py
and mypcurl.py
are different frbr:Items with the same frbr:Manifestation.
# from pcurl.py.prov.ttl:
<tag:tw.rpi.edu,2011:filed:SVbQMPyfteayT_XeWKRnygrxhqoAMncsgdRwexQtugw=/sha-256-gvr2NDAF7C0HOGuGFEoYwIbs7mQit_TABy8hQJHIlhU=/pcurl.py>
a frbr:Item;
nfo:fileUrl <file:////Users/lebot/pcurl.py>,
<pcurl.py>;
dcterms:modified "2012-01-03T11:05:33"^^xsd:dateTime;
frbr:exemplarOf <tag:tw.rpi.edu,2011:manifestation:sha-256-81X-JdHSWIdGwDaFk8Mlv8iW_TqlUpG2UCZh1ue04HU=>;
...
# from mpcurl.py.prov.ttl:
<tag:tw.rpi.edu,2011:filed:SVbQMPyfteayT_XeWKRnygrxhqoAMncsgdRwexQtugw=/sha-256-gvr2NDAF7C0HOGuGFEoYwIbs7mQit_TABy8hQJHIlhU=/mypcurl.py>
a frbr:Item;
nfo:fileUrl <file:////Users/lebot/mypcurl.py>,
<mypcurl.py>;
dcterms:modified "2012-01-03T11:05:33"^^xsd:dateTime;
frbr:exemplarOf <tag:tw.rpi.edu,2011:Manifestation:sha-256-81X-JdHSWIdGwDaFk8Mlv8iW_TqlUpG2UCZh1ue04HU=>;
...
A file's absolute directory path and modification date are used to name the frbr:Item. If either change, a new name is given. The file's directory path includes the machine that is hosting the directory.
The name for the frbr:Item tag:tw.rpi.edu,2011:filed:SVbQMPyfteayT_XeWKRnygrxhqoAMncsgdRwexQtugw=/sha256-gvr2NDAF7C0HOGuGFEoYwIbs7mQit_TABy8hQJHIlhU=/pcurl.py
is constructed by concatenating:
tag:tw.rpi.edu,2011:
filed:
-
SVbQMPyfteayT_XeWKRnygrxhqoAMncsgdRwexQtugw=
(a hash of the machine hosting the directory) /
-
sha256-gvr2NDAF7C0HOGuGFEoYwIbs7mQit_TABy8hQJHIlhU=
(a hash of the directory and the modification date of the file) /
-
pcurl.py
(the file name)
(todo)
<tag:tw.rpi.edu,2011:Manifestation:sha-256-81X-JdHSWIdGwDaFk8Mlv8iW_TqlUpG2UCZh1ue04HU=>
If any character of mypcurl.py
changes, the derived frbr:Item will have a different frbr:Manifestation and frbr:Expression from that of pcurl.py
because we cannot automatically identify these more abstract notions for the procedural python instructions.
However, this shortcoming can be overcome when your files encode RDF instead of procedural code. To demonstrate this, we use $CSV2RDF4LOD_HOME/bin/util/tic.sh to obtain some (incomplete) RDF description of the python script, such as its author.
bash-3.2$ tic.sh mypcurl.py > mypcurl.py.ttl
bash-3.2$ cat mypcurl.py.ttl | grep "doap:developer"
doap:developer twi:JamesMcCusker ;
Although changing the serialization of the Turtle describing mypcurl.py
results in a new frbr:Manifestation, the new frbr:Item associates to the same frbr:Expression as the first.
bash-3.2$ rapper -q -g -o rdfxml-abbrev mypcurl.py.ttl > mypcurl.py.ttl.rdf
bash-3.2$ fstack.py --no-paths mypcurl.py.ttl
bash-3.2$ fstack.py --no-paths mypcurl.py.ttl.rdf
Some endeavors in the FRBR stack are named in the tag scheme. This was done to use a reserved namespace that people could compute hashes into to allow for "serendipitous" URI collision. Since it wasn't dereferenceable, there was no chance for it to be "take over" by someone who would put misleading information into it.
Using tag scheme is a pure approach, but hinders discoverablity.
We may also want HTTP so the RDF around the "pure, serendipitous" tag URIs can be discoverable. It's cute that the URIs of your file and my file align conceptually (and even physically on each of our computers). Now, let's get to actually finding the connection so we can learn something!
Perhaps we mix both in?
We automatically generate the tag URI AND an HTTP URI within our own namespace, then relate the two?
generated on Jim's machine:
tag:THE_HASH prov:alternateOf http://jimbo.org/id/frir/THE_HASH .
generated on Nick's machine:
tag:THE_HASH prov:alternateOf http://nick-o-roonie.org/id/frir/THE_HASH .
owl:hasKey the hashes (and use tag: for them) and have the entities land where they will?