Wrapper around the Dutch Alpino parser. It takes as input a text/NAF/KAF file with either raw text or tokens (processed by a tokeniser and sentence splitter) and generates the term layer (lemmas and rich morphological information), the constituency layer and the dependency layer.
There are two dependencies, the Alpino parser, and the KAfNafParserPy library for parsing NAF/KAF objects.
Step 1. For the Alpino parser you have two choices.
- For a local install, visit the Alpino homepage and follow the instructions to get Alpino installed, or run
install_alpino.sh
. Make sure to setALPINO_HOME
to point to the installation. - For using an alpino server instance (e.g. through alpino-docker), point
ALPINO_SERVER
to the HTTP address of the server (e.g.ALPINO_SERVER=http://localhost:5002
)
Step 2. The KafNafParserPy library can be install through pip or from GitHub.
Once you have the previous 2 steps completed, the last step is to clone this repository to your machine. You will need to tell the library where Alpino has been installed in your machine by setting the environment variable ALPINO_HOME
, and point it to the correct path on your local machine.
export ALPINO_HOME=/home/a/b/c/Alpino
The simplest way to call the parser is to call to the script run_parser.sh
, which can be found in the root folder of the repository. It will read a NAF/KAF file from the input stream and will write the NAF/KAF resulting file in the output stream. In the subfolder examples
you can find 2 example input files with the corresponding and expected output files. From the command line and being on the root folder you can run:
cat examples/file1.in.kaf | run_parser.sh > my_output.kaf
The result in my_output.kaf
should be the same as the file examples/file1.out.kaf
(with exception of the time stamps).
You can specify the maximum number of seconds that Alpino will take to parse every sentence. Sentences taking longer that this value will be skipped from the parsing, and there will not be term, constituency nor dependency information for all the tokens of those sentences. The parameter to be used is -t
or --time
.
You can get the whole description of the parameters by calling python core/morph_syn_parser.py -h
. You will see this information:
usage: morph_syn_parser.py [-h] [-v] [-t MAX_MINUTES]
Morphosyntactic parser based on Alpino
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-t MAX_MINUTES, --time MAX_MINUTES
Maximum number of minutes per sentence. Sentences that
take longer will be skipped and not parsed (value must
be a float)
If you want to use this library from a Python module, it is possible to import the main function and reuse it in other
python scripts. The main module is located in the script core/morph_syn_parser.py
, and it is called run_morph_syn_parser
. This function takes two parameters, an input and an output file, which can be file names (strings), open file descriptors or streams.
- Ruben Izquierdo
- Vrije University of Amsterdam
- [email protected] - [email protected]
- http://rubenizquierdobevia.com/