-
Notifications
You must be signed in to change notification settings - Fork 0
Implementation Details
-
Available from Subversion https://gtsvn.uit.no/langtech/trunk/langs/kal/ as open source under the GPLv3 license
-
Build scripts maintained and hosted by Giellatekno at the University of Tromsø
-
Data maintained by Oqaasileriffik, primarily:
- Per Langgård ([email protected])
- Liv Molich ([email protected])
- Najannguaq Nielsen ([email protected])
- Paneeraq Nielsen ([email protected])
-
Ready-to-use nightly builds available at https://apertium.projectjj.com/apt/nightly/pool/main/g/giella-kal/ - from the Debian package, extract the
/usr/share/voikko/3/kl.zhfst
file.
There are a two main cmdline tools for using the spell checker: hfst-ospell and libdivvun. Both are available as nightly builds for various Linux distros, Windows, and macOS via the Apertium build repository.
To install and run a basic plain text spell checker session on Debian/Ubuntu, the steps are as in the Docker image.
libdivvun is currently used for the web and HTML5 frontends for Google Docs and Microsoft Office.
Install everything:
$ sudo apt-get install wget ca-certificates
$ wget https://apertium.projectjj.com/apt/install-nightly.sh -O - | sudo bash
$ sudo apt-get install giella-kal divvun-gramcheck
Run text through tokenizer and spell checker:
$ echo "Aajap biilinik misissuisoqartanginnera kamassutigigaa" | kal-tokenise | divvun-cgspell -u 1.0 -n 5 /usr/share/voikko/3/kl.zhfst
"<Aajap>"
"Aaja" Dial/Sgr Sem/Fem Sem/Hum Prop Rel Sg
"Aaja" Dial/Sgr Sem/Mask Sem/Hum Prop Rel Sg
"<biilinik>"
"biili" Dial/Ngr N Ins Pl
"biili" N Ins Pl
"<misissuisoqartanginnera>"
"misissuisoqartanginnera" ?
"misissuisoqartannginnera" <W:8> <WA:0> <spelled> "<misissuisoqartannginnera>"
"misissuisoqartannginnerai" <W:18> <WA:0> <spelled> "<misissuisoqartannginnerai>"
"misissuisoqartannginnerat" <W:18> <WA:0> <spelled> "<misissuisoqartannginnerat>"
"misissuisoqartuannginnera" <W:18> <WA:0> <spelled> "<misissuisoqartuannginnera>"
"misissuissoqartannginnera" <W:18> <WA:0> <spelled> "<misissuissoqartannginnera>"
"<kamassutigigaa>"
"kamassut" GE Der/nv Gram/TV V Par 3Sg 3SgO
"kamassutige" Gram/TV V Par 3Sg 3SgO
That's a lot more information than we currently need, so I wrote spell-stream.pl to trim the excess:
$ echo "Aajap biilinik misissuisoqartanginnera kamassutigigaa" | kal-tokenise | divvun-cgspell -u 1.0 -n 5 /usr/share/voikko/3/kl.zhfst | ./spell-stream.pl
Aajap
biilinik
misissuisoqartanginnera @spell <R:misissuisoqartannginnera> <R:misissuisoqartannginnerai> <R:misissuisoqartannginnerat> <R:misissuisoqartuannginnera> <R:misissuissoqartannginnera>
kamassutigigaa
hfst-ospell is currently used a the backend for the native Microsoft Windows and Office spellers.
Install everything:
$ sudo apt-get install wget ca-certificates
$ wget https://apertium.projectjj.com/apt/install-nightly.sh -O - | sudo bash
$ sudo apt-get install giella-kal hfst-ospell
Run a single word through, asking for a number of possible suggestions:
$ echo "5 misissuisoqartanginnera" | hfst-ospell-office /usr/share/voikko/3/kl.zhfst
@@ hfst-ospell-office is alive
& misissuisoqartannginnera misissuisoqartannginnerai misissuisoqartannginnerat misissuisoqartuannginnera misissuissoqartannginnera
For the Google Docs and Microsoft Word frontends, a HTTP service is needed. callback.php implements such a service by doing minimal forwarding to the Docker service, and it is running live on my server.
For example: https://tinodidriksen.com/spell/kal/callback.php?a=grammar&t=%3Cs1%3E%0aAajap%20biilinik%20misissuisoqartanginnera%20kamassutigigaa%0a%3C/s1%3E yields JSON output:
{"a":"grammar","c":"<s1>\n\nAajap\nbiilinik\nmisissuisoqartanginnera\t@spell <R:misissuisoqartannginnera> <AFR:misissuisoqartannginnera> <R:misissuisoqartannginnerai> <AFR:misissuisoqartannginnerai> <R:misissuisoqartannginnerat> <AFR:misissuisoqartannginnerat> <R:misissuisoqartuannginnera> <AFR:misissuisoqartuannginnera> <R:misissuissoqartannginnera> <AFR:misissuissoqartannginnera>\nkamassutigigaa\n\n</s1>"}
The <s1>...</s1>
tag is explained below.
- Available from Git https://github.com/GrammarSoft/proofing-gasmso as open source under the GPLv3 license
- Live for Google Docs at https://chrome.google.com/webstore/detail/kukkuniiaat/llbddkfhjlnlkmjegopnaiaifnbodgcd
- Live for Microsoft Word at https://appsource.microsoft.com/product/office/WA104382089
- Stream format documentation: https://github.com/GrammarSoft/proofing-gasmso/wiki/Backend-Stream-Format
The GASMSO frontend sends multiple whole paragraphs to the HTTP service. In order to keep track of which segments belong where, each paragraph is wrapped in s
-tags, e.g. <s1>...</s1> <s2>...</s2>
. The returned data is a verticalized and annotated version of the input. Each token is on a line of its own, and if there are any error markings then they are followed by a single tab character. Error type markings are @-prefixed
, and spelling corrections are <R:...>
tags. Greenlandic currently only has 2 error types: @spell
for errors with corrections, and @unknown
for errors where no corrections can be found. Having the whole text in the output helps with matching the text to the source, as opposed to offsets which vary wildly.