Skip to content

Commit

Permalink
-- version 3.0
Browse files Browse the repository at this point in the history
- adds text2naf and naf2rdf
- updates to latest components
- changes pipeline filtering logic for excluded components
- customizes components-library location
- improves exception handling and dependency check
- refactoring
  • Loading branch information
Ubuntu User authored and sarnoult committed Apr 4, 2019
1 parent aafb9c0 commit 9d93032
Show file tree
Hide file tree
Showing 53 changed files with 808 additions and 820 deletions.
12 changes: 6 additions & 6 deletions cfg/component_versions
Original file line number Diff line number Diff line change
@@ -1,28 +1,28 @@
#!/bin/sh
#
# Component versions for the VU-RM-Pipeline
# version: 1.0.0
#
# Author: Sophie Arnoult
# Date: 20/03/19
# Date: 03/04/19
#----------------------------------------------------

# GitHub commit numbers

v_text2naf=fa4178d
v_morphosyntactic_parser_nl=85b7603
v_ixa_pipe_ned=062a983
v_vua_resources=e730ce6
v_vua_resources=ef75f30
v_svm_wsd=8bb5319
v_vuheideltimewrapper=484ed80
v_vuheideltimewrapper=3762c0e
v_vua_srl_nl=23a18eb
v_vua_srl_dutch_nominal_events=6115b31
v_multilingual_factuality=cbad484
v_opinion_miner_deluxePP=3d99e85
v_ontotagger=c3796c5
v_eventcoreference=3d4b32c

# Other sources

v_alpino=Alpino-x86_64-Linux-glibc-2.23-21514-sicstus
v_ixa_pipes=1.1.1
v_dbpedia_spotlight=0.7.1
v_ontotagger=v3.1.1
v_eventcoreference=v3.1.1
93 changes: 0 additions & 93 deletions cfg/pipeline-no-nominal-events.yml

This file was deleted.

15 changes: 9 additions & 6 deletions cfg/pipeline.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
- name: text2naf
input:
output:
- raw
cmd: text2naf.sh
- name: ixa-pipe-tok
input:
- raw
output:
- text
cmd: ixa-pipe-tok.sh
Expand Down Expand Up @@ -56,17 +62,16 @@
- vua-wsd
- name: vua-nominal-event-detection
input:
- text
- srl
- terms
output:
- srl
cmd: vua-nominal-event-detection.sh
after:
- vua-ontotagging
- name: vua-srl-dutch-nominal-events
input:
- terms
- deps
- srl
output:
- srl
cmd: vua-srl-dutch-nominal-events.sh
Expand All @@ -84,12 +89,10 @@
- name: vua-eventcoreference
input:
- srl
- terms
output:
- coreferences
cmd: vua-eventcoreference.sh
after:
- vua-srl
- vua-srl-dutch-nominal-events
- name: opinion-miner
input:
- text
Expand Down
15 changes: 9 additions & 6 deletions docs/Home.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,19 +33,22 @@ The script `run-pipeline.sh` allows to run the pipeline on a raw text document t

./scripts/run-pipeline.sh < input.txt > output.naf

See [advanced usage](https://github.com/cltl/vu-rm-pip3/blob/master/docs/installation.md) for more options.
See [advanced usage](https://github.com/cltl/vu-rm-pip3/blob/master/docs/usage.md) for more options.

### Docker
Alternatively, you can get and run a [Docker image of the pipeline](https://github.com/cltl/vu-rm-pip3/blob/master/docs/docker.md).

### RDF
The script `scripts/bin/naf2sem-grasp.sh` allows to extract an RDF file from pipeline output NAF files. See [advanced usage](https://github.com/cltl/vu-rm-pip3/blob/master/docs/installation.md) for more information.

## Further reading

- [the pipeline wrapper](https://github.com/cltl/vu-rm-pip3/blob/master/docs/operation.md): information on the pipeline-wrapper operation, detailing configuration, filtering, execution and error handling;
- [the Dutch pipeline](https://github.com/cltl/vu-rm-pip3/blob/master/docs/newsreader.md): lists the pipeline components used by the pipeline, as well as the dependencies between them;
- [installation and requirements](https://github.com/cltl/vu-rm-pip3/blob/master/docs/installation.md): requirements for installing the pipeline and instructions for installing on Windows;
- [the pipeline wrapper](https://github.com/cltl/vu-rm-pip3/blob/master/docs/operation.md): information on the pipeline-wrapper operation, detailing configuration, filtering, execution and error handling.
- [the Dutch pipeline](https://github.com/cltl/vu-rm-pip3/blob/master/docs/newsreader.md): lists the pipeline components used by the pipeline, as well as the dependencies between them.
- [installation and requirements](https://github.com/cltl/vu-rm-pip3/blob/master/docs/installation.md): requirements and installation instructions for Linux and Windows.
- [pipeline configuration](https://github.com/cltl/vu-rm-pip3/blob/master/docs/configuration.md): pipeline configuration, input/output files and instructions to modify the pipeline or its components.
- [advanced usage](https://github.com/cltl/vu-rm-pip3/blob/master/docs/usage.md): pipeline argument and advanced usage examples;
- [Docker image](https://github.com/cltl/vu-rm-pip3/blob/master/docs/docker.md): getting and running the docker image;
- [advanced usage](https://github.com/cltl/vu-rm-pip3/blob/master/docs/usage.md): pipeline arguments and advanced usage examples.
- [Docker image](https://github.com/cltl/vu-rm-pip3/blob/master/docs/docker.md): getting and running the docker image.

## Contact

Expand Down
6 changes: 3 additions & 3 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,12 +51,12 @@ The installation script currently hard-codes the sources of the pipeline compone
To add or replace pipeline components involves three steps:

- add installation instructions to the installation script `./scripts/install.sh`, and specify the component version in `./cfg/component_versions`. The latter is loaded by the installation script, and provides a quick overview of the versions of the components used by the pipeline.
- add the component to the pipeline yaml configuration file. You should specify the NAF input/output layers for that component, the name of its execution script, and possible dependencies with regard to other pipeline components.
- add an executable script for the component
- add the component to the pipeline yaml configuration file `./cfg/pipeline.yml`. You should specify the NAF input/output layers for that component, the name of its execution script, and possible dependencies with regard to other pipeline components.
- add an executable script for the component in `./scripts/bin`.

### Running the pipeline with alternative components

To have two pipelines differ by a single component, the simplest method is to specify two pipeline configuration files, one for each alternative component.
To have two pipelines differ by a single component, one can specify two pipeline configuration files (`./cfg/pipeline1.yml` and `./cfg/pipeline2.yml) that differ by that component.

### Modifying the settings or arguments for a given component

Expand Down
13 changes: 12 additions & 1 deletion docs/docker.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,21 +13,27 @@ docker build -t vu-rm-pip3 .
```

## Usage
### Running the pipeline
The image takes a raw text file (UTF-8) as argument, and returns a processed NAF file (to `stdout`) and a log file (through `stderr`). The image accepts the following optional arguments:

- operation mode `-m`: can be set to

- `all` -- default, runs the full pipeline;
- `opinions` -- runs the pipeline up to the *opinion* layer;
- `srl` -- runs the pipeline up to the *srl* layer;
- `entities` -- runs the pipeline up to the *entities* layer, including named entity linking;

- nominal events switch `-n`: per default, all the components for the `srl` are run; with the nominal-events switched on, nominal-event detection and labelling components are *excluded*, and only the SRL components related to verbal predicates are run;
- alpino time out `-t`: defines the maximal per-sentence time budget for the Alpino parser (default is None);
- opinion-miner model data `-d`: defines the model data used by the opinion miner (default is 'news').

Alternatively, the flag `-w` allows to directly specify a string of wrapper arguments, as documented [here](https://github.com/cltl/vu-rm-pip3/blob/master/docs/usage.md).

### RDF conversion
The image can also be used to extract an RDF representation of a NAF file, using the flag `-r`.

### Example
To run the image on an example file `./example/test.txt` with the `opinions` mode, and write the output and log files back into `./example/`, run:
To run the image `vucltl/vu-rm-pip3` on an example file `./example/test.txt` with the `opinions` mode, and write the output and log files back into `./example/`, run:
```
docker run -v $(pwd)/example/:/wrk/ vucltl/vu-rm-pip3 -m opinions /wrk/test.txt > example/test.out 2> example/test.log
```
Expand All @@ -36,3 +42,8 @@ Alternatively, you can call the pipeline with its wrapper arguments
docker run -v $(pwd)/example/:/wrk/ vucltl/vu-rm-pip3 -w "-o opinions" /wrk/test.txt > example/test.out 2> example/test.log
```

To run the full pipeline, and extract RDF from its output, run
```
docker run -v $(pwd)/example/:/wrk/ vucltl/vu-rm-pip3 /wrk/test.txt > example/test.naf 2> example/test.log
docker run -v $(pwd)/example/:/wrk/ vucltl/vu-rm-pip3 -r /wrk/test.naf > example/test.rdf
```
30 changes: 24 additions & 6 deletions docs/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,24 @@
The VU Reading Machine can be run on Linux and Windows using WSL; for Mac, you will need to adapt the installation of some component dependencies, notably the Alpino parser.

### Component dependencies
Not all component dependencies are installed by `install.sh`. Dependencies include: java (jdk8), maven3, python3, pip3, lib2to3, timbl, libtcl, libtk, libxslt, libxss1, libxft2, unzip, gawk, gcc, git, bash and lsof.
Not all component dependencies are installed by `install.sh`.
For Ubuntu 18.04, run:

sudo apt-get install g++ gawk git libxslt-dev make maven tcl timbl tk unzip python3-pip python3-venv

### Python components and environment
The pipeline is written for python 3, and was tested with python 3.5 and 3.6. The wrapper is not compatible with python 2. Required python packages for the pipeline are recorded under `./cfg/requirements.txt`.

Within the python environment of your choice, do:
```
pip install -r ./cfg/requirements.txt
```
You can define python 3 as your default python through

echo "alias python='python3'" >> ~/.bash_aliases
source ~/.bash_aliases

Next, create a virtual environment, and install `wheel` and `python-pytest`

python -m venv venv3
source venv3/bin/activate
pip install wheel python-pytest

### Java components
The pipeline also depends on a number of java components, most of which must be compiled with Maven. We tested the compilation of these components with Maven 3.5.4 and Java 1.8.
Expand All @@ -30,12 +39,21 @@ export PATH=${MAVEN_HOME}/bin:${JAVA_HOME}/bin:${PATH}
The VU-RM-PIP3 pipeline repository contains the python 3 wrapper as well as code to install and run components of the Dutch NewsReader pipeline. To clone the VU-RM-PIP3 repository:

git clone https://github.com/cltl/vu-rm-pip3.git
cd vu-rm-pip3

Install Python dependencies from within the Python environment of your choice:

pip install -r cfg/requirements.txt

Run the script `install.sh` to install the components of the Dutch NewsReader pipeline:

./scripts/install.sh

The installation script loads a file `./cfg/component_versions` that records the versions of the pipeline components (either GitHub commit numbers or version tags). Installed components, models and resources are stored in `./lib/`.
The installation script loads a file `./cfg/component_versions` that records the versions of the pipeline components (either GitHub commit numbers or version tags). Installed components, models and resources are stored in `./lib/` per default.
The installation script accepts two arguments:

- `-c`: clean install; removes the components library
- `-l`: allows to set a different path for the components library

The script `run-pipeline.sh` allows to run the pipeline on a raw text document to produce a fully annotated NAF document:

Expand Down
13 changes: 8 additions & 5 deletions docs/newsreader.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
## NAF layers
NAF annotations in the Dutch pipeline consist of the following layers:

- raw: raw text
- text: tokenized words
- terms: word senses combined with morphosyntactic information
- deps: dependency parses
Expand All @@ -18,14 +19,15 @@ NAF annotations in the Dutch pipeline consist of the following layers:
## Components
Our version of the Dutch NewsReader pipeline uses the following components:

- NAF formatting: [text2naf](https://github.com/cltl/text2naf)
- tokenizing: [ixa-pipe-tok](https://github.com/ixa-ehu/ixa-pipe-tok)
- POS tagging, lemmatization and parsing: [vua-alpino](https://github.com/cltl/morphosyntactic\_parser\_nl)
- named entity recognition: [ixa-pipe-nerc](https://github.com/ixa-ehu/ixa-pipe-nerc/blob/master/README.md)
- named entity disambiguation: [ixa-pipe-ned](https://github.com/ixa-ehu/ixa-pipe-ned/blob/master/README.md)
- word sense disambiguation: [vua-wsd](https://github.com/cltl/svm\_wsd)
- time/date standardisation: [vuheideltimewrapper](https://github.com/cltl/vuheideltimewrapper)
- predicate-matrix tagging: [vua-ontotagging](https://github.com/cltl/OntoTagger)
- semantic role labelling: [vua-srl](https://github.com/newsreader/vua-srl-nl)
- semantic role labelling: [vua-srl-nl](https://github.com/sarnoult/vua-srl-nl)
- factuality: [multilingual\_factuality](https://github.com/cltl/multilingual\_factuality)
- opinion mining: [opinion\_miner\_deluxePP](https://github.com/rubenIzquierdo/opinion_miner_deluxePP)
- event coreference: [EventCoreference](https://github.com/cltl/EventCoreference)
Expand All @@ -41,18 +43,19 @@ Components either generate one or more layers or modify a layer. They depend on

component | input layers | *required components* | output layers
:---------|:--------------------------|:-------------|:-------
text2naf | | | raw
ixa-pipe-tok | raw | | text
vua-alpino | text | | terms, deps, constituents
ixa-pipe-nerc | text, terms | | entities
ixa-pipe-ned | entities | | entities
vuheideltimewrapper | text, terms | | timeExpressions
vua-wsd | text, terms | | terms
vua-ontotagging | terms | *+vua-wsd* | terms
vua-srl | terms, deps, constituents | | srl
vua-framenet-classifier | terms, srl | *+vua-wsd, vua-srl, vua-ontotagging* | srl
vua-nominal-event-detection | text, terms | *+vua-wsd, vua-ontotagging* | srl
vua-srl-nl | terms, deps, constituents | | srl
vua-framenet-classifier | terms, srl | *+vua-srl-nl, vua-ontotagging* | srl
vua-nominal-event-detection | srl, terms | | srl
vua-srl-dutch-nominal-events | terms, dependencies, srl | *+vua-nominal-event-detection* | srl
vua-eventcoreference | srl | *+vua-srl, vua-srl-dutch-nominal-events* | coreferences
vua-eventcoreference | srl, terms | | coreferences
opinion-miner | text, terms, deps, constituents, entities | | opinions
multilingual-factuality | terms, coreferences, opinions | | factualities

Expand Down
15 changes: 9 additions & 6 deletions docs/operation.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,10 @@ Each component is represented as a vertex in the graph. There is a one-to-one re

For each distinct component pair `(m_1, m_2)`, we add an edge from their respective vertices `v_1` to `v_2` if and only if:

- `v_1` outputs a NAF layer that is input to `v_2`
- *or* `m1` is prerequired by `m_2`
- `m1` is prerequired by `m_2`
- or `v_1` outputs a NAF layer that is input to `v_2` and `m_1` and `m_2` are not both modifying that layer

The last provision prevents cyclic dependencies: components that modify a same layer will appear as siblings in the graph.

Vertices with no incoming edges are connected to a root vertex.

Expand All @@ -38,17 +40,18 @@ The graph should be acyclic. Cyclic dependencies are detected when [sorting](#to

## Component filtering
The components configuration list allows to specify a maximal pipeline.
The graph created from this list can be refined to include only specific goal layers or input layers. It is also possible to exclude components that act on, i.e., produce or modify give input or goal layers.
Components can be excluded from this list prior to building the pipeline; one must ensure in this case that dependencies of downstream components are still satisfied.
The graph created from this list can be refined to include only specific goal layers or input layers.

#### Filtering by goal layers
Given a list of goal layers, the graph is filtered to keep only the components (vertices) that allow to produce these layers (keeping vertices on the path between root and these components). By default, all components with a goal layer are kept, including components that only modify that layer. The list of components can be reduced with an exclusion filter.
Given a list of goal layers, the graph is filtered to keep only the components (vertices) that allow to produce these layers (keeping vertices on the path between root and these components). By default, all components with a goal layer are kept, including components that only modify that layer.

#### Filtering by input layers
The pipeline normally expects a raw input file, and at least one component operating on an empty (or null) input layer list. One can however filter the graph with a list of input layers. In that case, vertices that produce these layers are kept together with their children vertices. Other vertices are filtered out. Vertices with no incoming edges after this filtering are reconnected to the root vertex. The resulting pipeline is intended to operate on a NAF file that contains these input layers.
The pipeline normally expects a raw input file, and at least one component operating on an empty input-layer list. One can however filter the graph with a list of input layers. In that case, vertices that produce these layers are kept together with their children vertices. Other vertices are filtered out. Vertices with no incoming edges after this filtering are reconnected to the root vertex. The resulting pipeline is intended to operate on a NAF file that contains these input layers.
As with goal-layer filtering, all components acting on the input layers are kept, except otherwise specified by an exclusion filter.

#### Combined filtering
On can combine input-layer filtering with goal-layer filtering. In that case, input-layer filtering is performed first. Components to exclude from the input or goal layers are to specified together: only components relevant to a given input/goal layer are taken into account during the corresponding filtering stage.
On can combine input-layer filtering with goal-layer filtering. In that case, input-layer filtering is performed first.


## Topological sorting and pipeline execution
Expand Down
Binary file modified docs/pipe-graph.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 9d93032

Please sign in to comment.