Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a fast and simpler way to get up-to-date CSV file export #1670

Open
Tracked by #6429 ...
CharlesNepote opened this issue Feb 18, 2019 · 3 comments
Open
Tracked by #6429 ...
Assignees
Labels
Data export We export data nightly as CSV, MongoDB… See: https://world.openfoodfacts.org/data environment P2 🚅 Performance

Comments

@CharlesNepote
Copy link
Member

CharlesNepote commented Feb 18, 2019

Summary:

CSV file is an easy way to consume OFF data. As of today, we have to download the whole database (2 GB!). It's quite long and consuming precious OFF server resources.

Expected behaviour:

Quick download based on diff or decentralized resources (peer-to-peer).

To be investigated

zsync investigation

zsync looks promising:

  • very simple to setup and use
  • fast: less than 10 seconds to build the .zync file related to 2 Gb Open Food Facts CSV export
  • can save up to 99% bandwidth (see this benchmark)
  • zsync takes less CPU time than rsync
  • zsync downloads use http (and thus can be logged)
  • zsync is used by many linux distroto update live iso

Implementation:

  • On server and clients: apt install zsync # debian, ubuntu
  • Create .zsync file on server:
    • Setup: run every day on Open Food Facts server zsyncmake en.openfoodfacts.org.products.csv # create en.openfoodfacts.org.products.csv.zsync (~4 Mb) (implement in ./script/export_database.pl?)
    • zsyncmake man page: https://helpmanual.io/man1/zsyncmake/ ; doc: http://zsync.moria.org.uk/server
    • -z option can save more bandwith but building the compressed file take up to 2 minutes at 99% cpu
  • Clients: zsync https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.csv.zsync

Part of

@CharlesNepote CharlesNepote added speed P2 environment Data export We export data nightly as CSV, MongoDB… See: https://world.openfoodfacts.org/data 🚅 Performance labels Feb 18, 2019
@CharlesNepote CharlesNepote self-assigned this Feb 18, 2019
@CharlesNepote CharlesNepote changed the title Provide a simpler way to get up-to-date and fast CSV file export Provide a fast and simpler way to get up-to-date CSV file export Feb 27, 2019
@CharlesNepote
Copy link
Member Author

@stephanegigandet @hangy I investigated zsync for incremental CSV downloads and found that it seems to be a very good and light solution (see above). I would like to add it to ./script/export_database.pl =>
system "zsyncmake en.openfoodfacts.org.products.csv"

What do you think about it? If it's ok, @stephanegigandet would you apt install zsync on the (dev?) server?

@hangy
Copy link
Member

hangy commented Feb 27, 2019

It looks like zsync is pretty much limited to *nix systems. I don't know if that might pose a problem. Otherwise, the idea sounds pretty nice.

@CharlesNepote
Copy link
Member Author

@hangy zsync seems to be proposed with Cygwin under windows and also as exe here: https://app.assembla.com/spaces/zsync-windows/documents

zsync is also on Mac OS X: http://macappstore.org/zsync/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Data export We export data nightly as CSV, MongoDB… See: https://world.openfoodfacts.org/data environment P2 🚅 Performance
Projects
Status: To discuss and validate
Development

No branches or pull requests

3 participants