Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: dq_all_val_in_nutrition_are_identical #9320

Merged
merged 8 commits into from
Dec 5, 2023

Conversation

benbenben2
Copy link
Collaborator

@benbenben2 benbenben2 commented Nov 14, 2023

What

New error, when all values in the nutrition table are equal.
Note that it can happen only when they are equal to 0 because otherwise salt and sodium values are automatically calculated from each other's value.

Additionally, not coherent naming in the data quality taxonomy (Error vs errrors) + noticed that the first part of the file lists errors and second part lists warnings, but some errors were listed in the second part, so I moved them up to the first part.

Screenshot

Screenshot_20231114_234657

Link to deployed facet

https://world.openfoodfacts.org/data-quality-error/nutrition-values-are-all-identical

Related issue(s) and discussion

@benbenben2 benbenben2 self-assigned this Nov 14, 2023
@benbenben2 benbenben2 requested a review from a team as a code owner November 14, 2023 22:51
@github-actions github-actions bot added 🧬 Taxonomies https://wiki.openfoodfacts.org/Global_taxonomies 🧽 Data quality https://wiki.openfoodfacts.org/Quality 🧪 tests labels Nov 14, 2023
@codecov-commenter
Copy link

codecov-commenter commented Nov 14, 2023

Codecov Report

Attention: 16 lines in your changes are missing coverage. Please review.

Comparison is base (d3f8740) 44.30% compared to head (bfce7ff) 48.99%.
Report is 4 commits behind head on main.

Files Patch % Lines
lib/ProductOpener/Display.pm 33.33% 13 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9320      +/-   ##
==========================================
+ Coverage   44.30%   48.99%   +4.69%     
==========================================
  Files          64       66       +2     
  Lines       20333    20417      +84     
  Branches     4891     4903      +12     
==========================================
+ Hits         9008    10004     +996     
+ Misses      10150     9144    -1006     
- Partials     1175     1269      +94     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

# $nutriments_values_occurences_max_key = $key;
}
}
# raise error if all values are identical (this can only apply when they are 0 (because salt or sodium are automatically generated from each others))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then what would be the difference with the already existing en:Nutrition all values zero ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not see that before, actually. That already existing dq error facet has nothing: https://world.openfoodfacts.org/data-quality-error/en:all-nutrition-values-are-set-to-0
Because:
1)

if ($nid !~ /fruits-vegetables-nuts-estimate-from-ingredients/) {
	if ($product_ref->{nutriments}{$nid} == 0) {
		$nid_zero++;
	}
	else {
		$nid_non_zero++;
	}
}

does not exclude "fruits-vegetables-legumes-estimate-from-ingredients_100g"

comment: instead of adding this exclusion, that would be better to list which main nutrients we want to select. That way if tomorrow there is a new fruits-something-blabla_100g we will not have problem if we forget to update that condition.

there is a typo:
($nid_zero == $nid_n)
where
$nid_zero: number of $nid (nutrient) that contains "_100g" and equal to 0. That number is at most all "_100g" minus fruits-vegetables-nuts-estimate-from-ingredients_100g.
$nid_n: all nutrient including "_100g" as well as without "_100g". Example for proteins: "proteins", "proteins_100g", "proteins_unit", "proteins_value".

According to my comment, I am removing this dq facet to replace it by the one implemented in this PR.

push @{$product_ref->{data_quality_warnings_tags}}, "en:nutrition-3-or-more-values-are-identical";
last;
}
}

# retrieve max number of occurences
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that you can do something simpler that would also work when people put a non zero value (e.g. 1) in all fields:

Just check if you have only 1 key, or you have 2 keys and 1 of them is "salt_100g" or "sodium_100g". (and also that you have at least 3 or 4 entries in @major_nutriments_values)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did something like that. But, because it is an error, I would not add that part "and also that you have at least 3 or 4 entries" for now.

Note that there are not much more rows if we count salt or sodium instead of both (about 10 more): mirabelle

@benbenben2
Copy link
Collaborator Author

I do not understand why test needed to be updated for that part:
"compared_to_category" : "en:cereals-and-potatoes",

@benbenben2
Copy link
Collaborator Author

benbenben2 commented Nov 15, 2023

Note for later (maybe @CharlesNepote could be interested for Data Quality).

If we count energy kcal or kj instead of both that leads to 10000 rows: mirabelle

With many false positives:
https://world.openfoodfacts.org/product/0896161001235/alkaline-water
https://world.openfoodfacts.org/product/8001120715456/te-nero-zenzero-e-scorze-di-limone-biologico-coop
https://world.openfoodfacts.org/product/0086854052747/canola-oil-cooking-spray-laura-lynn
https://world.openfoodfacts.org/product/0011141107442/coffee-beans
https://world.openfoodfacts.org/product/8720354187838/eau-mineral-petillante-hema

  • Some of them are kj=1 kcal 0. ... In that case we could count only if kj is higher than 4 (because 4.2 kj = 1 kcal) which would imply kcal = 1 at least
  • We could add category whitelist... (waters, coffees, ).. although many of the product listed above do not have category.
  • Most of these false positives are in the u.s. where only cal (kcal) is provided (can be all 0 because it is per portion). So we are good if we keep counting both kj and kcal because kj is missing for these products (field should be empty)

Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@stephanegigandet stephanegigandet merged commit f18bf44 into main Dec 5, 2023
14 checks passed
@stephanegigandet stephanegigandet deleted the dq_all_val_in_nutrition_are_identical branch December 5, 2023 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🧽 Data quality https://wiki.openfoodfacts.org/Quality 🧬 Taxonomies https://wiki.openfoodfacts.org/Global_taxonomies 🧪 tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prevention - Nutrition - Reject edits where all entered nutrition values are 0
3 participants