Skip to content

alexandretessarollo/sensetion.el

 
 

Repository files navigation

Word-sense annotation in Emacs

This project intends to finish the word-sense annotations of the WordNet gloss corpus.

static/sensetion-menu.png static/sense-menu.png

Installation

See the INSTALL file.

Customization

sensetion-sense-menu-show-synset-id
set this variable to true to show the synset ids in the list of synsets during the annotation.

Usage

Use M-x sensetion to start annotating. If this is the first time you call it, it might take some time to index the files (you can do other stuff on Emacs while it works); else it’ll read the index files, and start the annotation process.

M-x sensetion will ask you for a lemma and a PoS tag. You can press TAB for completion of the lemma. This will build a buffer with instances of this lemma+pos that have not been annotated yet. Unannotated words will show as red/pink, while previously-annotated tokens will show as green, and newly-annotated tokens will show as blue (all these colors can be customized by the user). For annotated tokens, dark colours indicate confidence in the annotation. You can navigate through annotatable tokens with < and >.

If you wish to sense-tag a token, press / on it. You may select one or many senses – or no sense at all – by pressing the appropriate keyboard key. Senses which are already selected are prefixed by a plus sign (+); when satisfied, press enter/return. If you’d like to quit, press q; note that quitting does not undo anything (if you selected an option and then quit, its effects were already carried out and saved). You can see how many tokens still need to be annotated in the mode-line, next to the sensetion indicator.

If you wish to change a token’s lemma use l. If you wish to say a token is not annotatable (i.e., ignore it), use i. If you wish to say you are unsure about an annotation, use ?.

There is support for word collocations, such as phrasal verbs. The tokens part of a collocation are united by a key, which is shown in their bottom left corner. You can unglob a collocation by pressing u in any token of the collocation. To glob tokens, you mark them with m and finally press g to create the collocation. If you marked a token by mistake, you can unmark tokens by pressing m again. If you try to edit the lemma of a token part of collocation, you will be asked if you would like to edit the token itself or its collocation.

You can move sentences up or down with C-↑ and C-↓. Clustering tokens with the same sense together might be useful.

Note that you can customize most things (like annotation colors) with M-x customize-group RET sensetion.

Command summary

You can call a command using M-x <command-name> or by pressing its keybinding. If you find that there are too many editing and navigation commands to memorize, you just need to memorize the command s, which invokes a menu which includes all other commands.

command namekey bindingdescription
sensetion-Start sensetion annotation process.
sensetion-annotate-Start annotating a new lemma/PoS tag.
sensetion-edit-synset.Edit sentence source data (be careful!)
sensetion-edit-sense/Annotate sense of selected token at point
sensetion-edit-lemmalAnnotate lemma of token at point
sensetion-edit-ignoreiMarks file as to be ignored in the annotation process
sensetion-edit-unsure?Marks annotation as done with little confidence
sensetion-toggle-glob-markmMark/Unmark token for globbing
sensetion-globgGlob the marked tokens as a new collocation and ask for its lemma
sensetion-unglobuUnglob collocation of token at point (removes collocation)
sensetion-toggle-scriptsvShow/hide super/subscripts
sensetion-next-selected>Move point to next selected token
sensetion-previous-selected<Move point to previous selected token
sensetion-move-line-upC-↑Move sentence up
sensetion-move-line-downC-↓Move sentence down

Indexing

If the index goes out of sync, you can force a new indexation with M-x sensetion-make-index.

Saving your work

  • Any annotations are saved to their files at the moment they are done.
  • The updated index is saved by default to your annotation directory in the file .sensetion-index when you quit emacs gracefully (that’s why it hangs a little). This path is customizable.

Seeing your work

We recommend (although it is not strictly necessary) setting up a git repository for the annotation files (see any git tutorial if you are unfamiliar with it). Use

git diff --color-words=.

(note the period .) to see the changes you made after the previous commit.

In any case, please back up your work!

Report bugs

  • Give clear instructions to reproduce the bug;
  • Call M-x toggle-debug-on-error, reproduce the bug, and send the backtrace with your report (you may open an issue).

FAQ – Frequently Asked Questions

How can I copy and paste annotation text without super/subscripts?

You can assign the function org-copy-visible to your copying command key in the annotation buffers by adding these two lines to your sensetion use-package declaration:

(use-package sensetion
   :commands sensetion
+  :bind (:map sensetion-mode-map
+              ("M-w" . org-copy-visible))

Annotation format

We convert the original XML files to property lists, whose grammar can be found in glosstag/grammar.txt..

The script that converts the original XML WN gloss corpus is at convert.sh. To re-run the conversion:

  • download the WordNet gloss corpus;
    • you can validate it with the xmllint utility from the libxml package:
      xmllint --dtdvalid dtd/glosstag.dtd merged/*.xml
              
  • download and setup sbcl (although any common lisp implementation should work);
  • setup quicklisp;
  • create a symbolic link from the glosstag directory inside the quicklisp/local-projects/.
  • run:
    ./convert.sh ~/WordNet-3.0/glosstag/merged/ DESTINATION-PATH/
        

    where the first parameter is a directory is from the gloss corpus archive, the last parameter is the directory is where you want to put the files. (Use absolute paths if you have problems with the command.) Note that the trailing slash in glosstag/ is important. You must have the glosstag DTD in the same directory as the annotation files.

Status

Under heavy development – user interface is unstable, and the code is still to be generalized so that it can be made useful for annotation of other corpora (maybe even of other stuff).

About

WIP: Emacs word-sense annotation interface

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Emacs Lisp 67.4%
  • Common Lisp 31.8%
  • Other 0.8%