Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: taxonomy enhancer #11267

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

benbenben2
Copy link
Collaborator

What

Based on the following comment on Slack:

There is a single language used to parse ingredients in Open Food Facts (based on ingredient_text, and leading to ingredients, ingredients_lc, ingredients_analysis, and more key values in the api)
All languages for a product are given in ingredients_text_
That might already have been discussed but would that be an option to parse ingredients of all languages?
Eventually, given some conditions on ingredients_text_ (no quality errors/warning like @ symbol for example)
It would make us slowly go toward an autocompletion of ingredients taxonomy. Example: if for the same product, you have 3 ingredients in English, all known. And 3 ingredients in Greek, only 2 known. And the position of these 2 corresponds to the position of their English translation. Then, we could safely add the entry of the unknown Greek ingredient to the ingredients taxonomy. With a PR, so that it can be reviewed.
We could start with very strict conditions like for this example no more than 3 ingredients + only one unrecognized ingredient.
We could also create quality info if text is way too long in a given language whereas all other languages have a shorter ingredients list. That would help to find stop words. Instead of waiting for contributors in each language.
It would allow to improve data quality as well.

The present PR aims to tackle the following:

  • detect stop words before ingredients list (missing_stop_words_before), and stop there if it finds something for that language, otherwise
  • detect stop words after ingredents list (missing_stop_words_after), and stop there if it finds something for that language, otherwise
  • compare ingredients one by one to:
    • detect missing translation in the taxonomy (missing_ingredients)
    • typo in ingredients list (could have been a dq error but->) or typo in the taxonomy (ingredients_typo)
    • ingredient having different ids between languages and those ids do not have a parent/child relationship (mismatch_in_taxonomy). This latter is ignored if it leads to more than 1 occurence because it can be due to new version of the product ingredients list and one language having new version while another language is having old version, in which case there is overlap between ingredients id but if order of ingredients is not the same as before it will find many mismatches, see below 8014190017627 example.

Integration
For devs purpose, it is using some facets, similar to data quality facets. Adding a route like this https://hr.openfoodfacts.org/data-quality-errors would make it easy to review these new facets.

These new facets cannot be data quality facets because a contributor cannot fix things by editing products, one needs to modify the taxonomies files or the code of ProductOpener.

Screenshot

Screenshot_20250118_214614

Related issue(s) and discussion

  • Fixes #-none-

@benbenben2 benbenben2 added 🧬 Taxonomies https://wiki.openfoodfacts.org/Global_taxonomies 🧬 Taxonomies - Translation 🥗 Ingredients labels Jan 18, 2025
@benbenben2 benbenben2 self-assigned this Jan 18, 2025
@benbenben2 benbenben2 requested a review from a team as a code owner January 18, 2025 21:02
@github-actions github-actions bot added 🧪 tests dependencies Pull requests that update a dependency file Products labels Jan 18, 2025
@benbenben2
Copy link
Collaborator Author

Some examples

  • 5900552071563

    blueberry juice from concentrate (English) [does not contain Polish]
    jagodowy (Polish) [English id for this word is bilberry]

  • 8076809513654

    "sucre, basilic 0, 2%, ail." -> "en:ingredients-basilic 0-is-new-translation-for-basilic"
    this is a parsing issue beause of the space after the comma
    can be fixed by editing product
    can be handle by improving productopener

  • 8014190017627

    "en:ingredients-taxonomy-between-fagiolini-and-mrkva-should-be-same-id"
    it: patate, zucchine, FAGIOLINI, CAROTE
    hr: krumpir, tikvice, mrkva, zeleni grah
    "taxonomies_enhancer_tags":
    [
    "en:ingredients-taxonomy-between-fagiolini-and-mrkva-should-be-same-id",
    "en:ingredients-taxonomy-between-bosiljak-and-grana-padano-should-be-same-id",
    "en:ingredients-taxonomy-between-lisozim-iz-jaja-and-sale-should-be-same-id",
    "en:ingredients-taxonomy-between-caglio-and-sol-should-be-same-id",
    "en:ingredients-taxonomy-between-mlijeko-and-sale-should-be-same-id",
    "en:ingredients-taxonomy-between-basilico-and-pesto-should-be-same-id",
    "en:ingredients-taxonomy-between-konzervans-and-olio-di-semi-di-girasole-should-be-same-id",
    "en:ingredients-taxonomy-between-lisozima-da-uovo-and-sirilo-should-be-same-id",
    "en:ingredients-taxonomy-between-pesto-and-tjestenina-od-durum-pšenice-should-be-same-id"
    ]

    -> ingredients list has probably been updated by the producer but only 1 lang has been updated in Open Food Facts. Order of ingredients has been swapped and although most of ingredients in list 1 are also in list 2 (more than 50%) because same ids are not in the same order it leads to many error.
    -> as described in the PR description on top to avoid this, if there is more than 1 error, nothing will be shown. So this list of facet will not appear anymore. I just leave it for the example.

  • 8712566328352

    "en:ingredients-hu:vanílla-darabkák-is-new-translation-for-en:ground-vanilla-beans"
    -> correct, it could be
    "en:ingredients-hu:szinezék-is-possible-typo-for-hu:színezék"
    -> correct, there is a typo
    "en:ingredients-hu:glükózsirup-is-possible-typo-for-hu:glükózszirup"
    -> correct, there is a typo

  • 5901069000817

    "en:ingredients-taxonomy-between-including-celeriac-id:en:celeriac-and-w-tym-seler-id:en:celery-should-be-same-id"
    -> correct, it could be

  • 2008080011938

    "en:ingredients-taxonomy-between-pork-skins-id:en:pork-skin-and-skórki-wieprzowe-id:en:pork-rind-should-be-same-id"
    -> correct, it could be or one could be child of the other

  • 5900552071563

    "en:ingredients-taxonomy-between-blueberry-juice-from-concentrate-id:en:blueberry-juice-from-concentrate-and-jagodowy-id:en:bilberry-should-be-same-id"
    -> correct jadowowy is blueberry (https://www.mercato.com/item/lowicz-dzem-jagodowy-jam-blueberry-280-grams/985629)

  • 2595000004335

    "en:ingredients-en:pork-ham-meat-is-possible-typo-for-en:pork-ham"
    -> PL: Mięso z szynki wieprzowej (id: en: ham, pork ham)
    -> EN: Pork ham meat (no id)
    -> need to add pork ham meat as a synonym for pork ham.

  • 5202390020407

    "en:ingredients-it:gluconodeltalattone-is-possible-typo-for-it:gluconedeltalattone"
    IT: not a typo in product BUT in the taxonomy

@codecov-commenter
Copy link

codecov-commenter commented Jan 18, 2025

Codecov Report

Attention: Patch coverage is 79.89950% with 40 lines in your changes missing coverage. Please review.

Project coverage is 49.62%. Comparing base (c0518c1) to head (c921be5).

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
lib/ProductOpener/TaxonomiesEnhancer.pm 80.10% 1 Missing and 38 partials ⚠️
lib/ProductOpener/Products.pm 66.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #11267      +/-   ##
==========================================
+ Coverage   49.33%   49.62%   +0.28%     
==========================================
  Files          79       80       +1     
  Lines       22508    22707     +199     
  Branches     5388     5458      +70     
==========================================
+ Hits        11105    11269     +164     
+ Misses      10042    10041       -1     
- Partials     1361     1397      +36     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file 🥗 Ingredients Products 🧬 Taxonomies - Translation 🧬 Taxonomies https://wiki.openfoodfacts.org/Global_taxonomies 🧪 tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants