Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate and update CSV and RDF exports through a PostgreSQL table #10326

Open
stephanegigandet opened this issue May 20, 2024 · 1 comment
Open
Labels
Data export We export data nightly as CSV, MongoDB… See: https://world.openfoodfacts.org/data postgresql

Comments

@stephanegigandet
Copy link
Contributor

It currently takes 4 hours every day (with 1 cpu used all time) to generate the 4 CSV and RDF exports:

-rw-r--r--   1 off  off    913482534 May 20 08:31 en.openfoodfacts.org.products.csv.gz
-rw-r--r--   1 off  off    925173613 May 20 08:33 fr.openfoodfacts.org.products.csv.gz
-rw-r--r--   1 off  off    457025800 May 20 08:35 en.openfoodfacts.org.products.rdf.gz
-rw-r--r--   1 off  off    457007947 May 20 08:36 fr.openfoodfacts.org.products.rdf.gz

(+ a long query to MongoDB to go through all products)

The export is done with this script: https://github.com/openfoodfacts/openfoodfacts-server/blob/main/scripts/export_database.pl

The script goes through products one by one, copy some fields, converts some fields (e.g. UNIX timestamps to ISO 8601), adds French and English translations for a few tag fields (categories, labels etc.), flattens some arrays into strings, and that's it.

I'm thinking that instead of going through all products in MongoDB every day, we could instead maintain a separate database similar to the one used in https://github.com/openfoodfacts/openfoodfacts-query with the same columns we export in CSV. We would then update only changed products using Redis.

There is also a RDF export that we generate at the same time as the CSV, one solution could be to generate and store the RDF for each product in a separate column.

@github-project-automation github-project-automation bot moved this to To discuss and validate in 🍊 Open Food Facts Server issues May 20, 2024
@teolemon teolemon added Data export We export data nightly as CSV, MongoDB… See: https://world.openfoodfacts.org/data postgresql ✨ Feature Features or enhancements to Open Food Facts server labels May 20, 2024
@john-gom
Copy link
Contributor

I did a bit of an experiment here using the Postgres database that contains all the product data, as previously mentioned here: #8620

You can get a lot of the data needed using the standard Postgres COPY TO command, e.g.

copy (
select 
  code
  ,'http://world-en.openfoodfacts.org/product/' || code url
,data->'created_t' created_t
,to_char(to_timestamp((data->>'created_t')::int), 'YYYY-MM-DD"T"HH24:MI:SS"Z"') created_datetime
,data->'last_modified_t' last_modified_t
,to_char(to_timestamp((data->>'last_modified_t')::int), 'YYYY-MM-DD"T"HH24:MI:SS"Z"') last_modified_datetime
,data->>'product_name' product_name
 ,array_to_string(ARRAY(SELECT json_array_elements_text(data->'packaging_tags')),', ') packaging_tags
 ,array_to_string(ARRAY(SELECT json_array_elements_text(data->'brands_tags')),', ') brands_tags
 ,array_to_string(ARRAY(SELECT json_array_elements_text(data->'categories_tags')),', ') categories_tags
,data->>'ingredients_text' ingredients_text
,data->'ecoscore_data'->'score' ecoscore_score
,data->'ecoscore_data'->>'grade' ecoscore_grade
 ,array_to_string(ARRAY(SELECT json_array_elements_text(data->'data_quality_errors_tags')),', ') data_quality_errors_tags
,data->'nutriments'->'energy-kj_100g' "energy-kj_100g"
,data->'nutriments'->'energy-kcal_100g' "energy-kcal_100g"
,data->'nutriments'->'energy_100g' "energy_100g"
from product
limit 1000
) to '/mnt/e/products.tsv' DELIMITER E'\t' null as ' ' CSV HEADER;

However, there are a number of issues:

  1. Even that limited query would take about 4 hours to run locally (it took about 5 seconds to do 1000 records)
  2. Without the taxonomies loaded it isn't possible to generate the English and French translations of the tags
  3. Special processing on things like the URL (appending an escaped version of the product name) would be tricky
  4. Null handling isn't ideal
  5. It's more difficult to trim out unwanted characters

From my previous experience with off-query I think that storing the data in a relational format rather than JSON will fix the performance issue (I did a small experiment and 1000 records took about 100ms using this approach).

My suggestion would be that we pre-process the data before loading into the off-query database to resolve issues 2, 3 and 5.

In addition, in order to address 2, we would need the taxonomies to be loaded into off-query (I have some code to do that) and we would need to canonicalize all of the tag values of the product before saving so that we can do a direct SQL join to the taxonomy translations table.

Adding this extra processing in the off-query import code will slow down the import a bit, but shouldn't have a big impact on the incremental sync from Redis.

In summary, the work items to address this using an off-query approach would be:

  • Add the extra fields needed to the off-query database
  • Add code to off-query to ensure that all taxonomy / tag values are canonicalized
  • Import taxonomies into off-query with a mechanism to refresh (e.g. when PO is deployed or just every night)
  • Code up the scheduled export from off-query

@teolemon teolemon removed the ✨ Feature Features or enhancements to Open Food Facts server label Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Data export We export data nightly as CSV, MongoDB… See: https://world.openfoodfacts.org/data postgresql
Projects
Status: To discuss and validate
Development

No branches or pull requests

3 participants