Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding term normalization / ontology lookup feature to DataHarmonizer #406

Open
ddooley opened this issue Aug 29, 2023 · 1 comment
Open

Comments

@ddooley
Copy link
Collaborator

ddooley commented Aug 29, 2023

We want the functionality, tied to LinkML specifications for a field to allow selected terms from one or more ontology branches or cherry-picked terms from them. Beyond this, we also need the dynamic ability, while editing a cell, to look up a list of closely related terms so that a user can normalize free text (containing one or more terms) into a list of ontology ids.

LexMapr was our older software for doing this and maybe with a revamp could be continued. There is also the OAK "annotate" or "lexmatch" commands we could try. Notes on these commands are in SLACK OBO Foundry group: https://obo-communitygroup.slack.com/archives/C03D93DEALA/p1692891065256629?thread_ts=1692889350.359419&cid=C03D93DEALA

Chris Mungall: Technically annotate is a bit more general as it finds term matches in whole text. But if you pass --matches-whole-text it essentially does lexmatch as a degenerate case

Chris Mungall: But note the output structures are different. Using lexmatch gives you SSSOM (by default) which is obviously very well designed and though through. I have not been able to gather interest in a profile of SSSOM for lexical matches but that was before we had the ISB workshop talking about matching literals https://github.com/mapping-commons/sssom/issues/155
[#155 is there interest in an analog of SSSOM for NER/CR/text annotation?](https://github.com/mapping-commons/sssom/issues/155)
There are a number of different tools that perform NER on text, from bioportal/zooma through to scispacy, [@cthoyt](https://github.com/cthoyt)'s Gilda ( gilda https://www.biorxiv.org/content/10.1101/2021.09.10.459803v1.full )
These all vary in their output but are some variant of text span location and ID plus metadata for the matched concept.
While the entity normalization step of NER could be seen as term matching, I think this is out of scope for SSSOM. However, I think it would make sense to have a SSSOM analog, where the SSSOM metadata element URIs are reused.
An OAK driven app: https://incatools.github.io/ontology-access-kit/datamodels/text-annotator/index.html  
[https://github.com/…](https://github.com/INCATools/ontology-access-kit/blob/main/src/oaklib/datamodels/text_annotator.yaml) 
@cpauvert
Copy link
Contributor

Hi @ddooley,
Thanks a lot for the great work in crafting and maintaining DataHarmonizer, I fully support such a feature! We showcased DataHarmonizer in a workshop about metadata and ontologies targeted for very beginners and most of them asked whether an ontology look-up from within the DataHarmonizer was possible.
Best,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants