Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big memory leak when adding / updating / importing products #2563

Open
Tracked by #8764
stephanegigandet opened this issue Nov 8, 2019 · 18 comments
Open
Tracked by #8764

Big memory leak when adding / updating / importing products #2563

stephanegigandet opened this issue Nov 8, 2019 · 18 comments
Assignees
Labels
Data import Fixed ? This issue might have been fixed. We can close them once we verify that they have. needs investigation P2 🚅 Performance 🏭 Producers Platform https://wiki.openfoodfacts.org/Platform_for_producers

Comments

@stephanegigandet
Copy link
Contributor

stephanegigandet commented Nov 8, 2019

There is a big memory leak when we update a product. It can be seen in particular when we launch the import_csv_file.pl script with a large (e.g. 1000) amount of products : the process keeps growing until it runs out of memory.

For a real product import of a big producer (e.g. 5000 products), the process grows to 10 Gb of memory.

There are many different things that happen when a product is updated: the ingredients lists are processed, allergens and additives detected, quality is measured etc. Each of those steps involve calling many different subroutines, in particular many functions from Tags.pm related to taxonomies.

At this point, we do not know what exact part of the pre-saving processing causes the memory leak (it is very likely that several parts contribute to it).

This needs further investigation, maybe by testing each part individually (e.g. running each function tens of thousands of times and measuring memory), and/or with tools like Devel::Gladiator etc.

Steps to see the memory leak:

  1. Use a CSV file of product data that looks like this:
code lc product_name_en ingredients_text_en        
7000000000000 en Test product Flour, salt, sugar, cranberries, water, lemon, orange, strawberries, pears, grapes        
7000000000001 en Test product Flour, salt, sugar, cranberries, water, lemon, orange, strawberries, pears, grapes        
7000000000002 en Test product Flour, salt, sugar, cranberries, water, lemon, orange, strawberries, pears, grapes        
[..] (with thousands of products)

in /srv/off/scripts/ , run:

./import_csv_file.pl --user_id debug --org debug --source_id debug --source_name debug --source_url debug --define lc=en --csv_file debug-code-product_name-ingredients2.csv

debug-code-product_name-ingredients2.csv.zip

Related discussions/issues: #2053 #2054

This issue is very important to resolve, because it also happens (more gradually) in production on the web site: whenever a product is updated, the Apache mod_perl process grows, and eventually it needs to be restarted.

Part of

@stephanegigandet stephanegigandet added 🐛 bug This is a bug, not a feature request. 🎯 P1 candidate labels Nov 8, 2019
@zigouras
Copy link
Contributor

zigouras commented Nov 10, 2019

I've spent some time trying to profile the import products Perl script.
I tried a number of memory profiling packages.
I was able to get output from Devel::Gladiator but it was hard to understand in some parts and other parts were just not useful. I tried Devel::Leak and Test::Memory::Leak and the process just hung and did not return anything.

In general, I would recommend having as few variables in global scope that are exported and shared between packages as possible. I noticed a lot of global scoping and exports which could be a concern. These are less likely to be garbage-collected. You should prefer variables in methods as they will be garbage-collected once the method is done executing.

I found a nice module called Devel::NYTProf. It does not profile memory so may not be ideal for our case. However, it has a detailed and expressive analysis of code execution times. It gives a graphical output that is very intuitive and rich.
It is easy to install and run without changing any code in your program. I would suggest you run it.
https://metacpan.org/pod/Devel::NYTProf

The output revealed a couple of things that might be worth looking further at:

  1. ProductOpener::Display::remove_tags_and_quote is called a lot. It may be worth cleaning the CSV first with one script and saving it as a clean output file and then import that with a second script.
  2. The initialization of some ProductOpener modules took about 1 second with one call. This may be a one time thing and is related to the extra initialization code in Tags.pm. I would remove as much needless initialization code that is outside of functions and in the global scope. This is more likely to cause memory problems.

Here is a screenshot of just some of the report when I ran the script on 100 products that had already been imported:
image

Here is a tree map of the time spent in each function. It is very informative but may not be helpful for our problems:
image

@zigouras
Copy link
Contributor

zigouras commented Nov 10, 2019

You may also want to make sure all of your CPAN modules are updated to the latest versions as older versions may have bugs that contribute to the memory leaks.

Make sure you don't have any circular references in your code. Perl cannot garbage collect them
https://perlmaven.com/eliminate-circular-reference-memory-leak-using-weaken
There is a module to help you find them:
https://metacpan.org/pod/Devel::Cycle

A good page with some references to Perl best practices to memory management:
https://metacpan.org/pod/MemoryProfiling

I am also looking at the Devel::MAT which seems like the most robust memory profiler suite for Perl. Here is a good tutorial if you want to get started:
http://perladvent.org/2017/2017-12-12.html
Note that you don't have to use a die(); command to get the dump, you can capture it on a signal interrupt (Ctrl-\) or at the end of the script, see https://metacpan.org/pod/Devel::MAT::Dumper

You can also use Memory::Stats to measure memory at checkpoints you specify in your code. You could see how the memory changes before and after certain blocks of code:
https://metacpan.org/pod/Memory::Stats

@stephanegigandet
Copy link
Contributor Author

Hello @zigouras ! Thank you very much for taking the time to do this analysis!

The reports from Devel::NYTProf are very interesting, knowing which function is called the most and from where is going to be very useful, I'll definitely investigate what we can do for ProductOpener::Display::remove_tags_and_quote

I read about the circular references in the past but I wasn't able to identify where I'm creating some, I don't think I create them (at least not intentionaly).

@hangy
Copy link
Member

hangy commented Nov 10, 2019

Generally, according to Devel::MAT, the largest objects (after importing 100 products) are still synonyms (with some 20 MiB each), which should be pretty constant.

I also tried outputting Devel::MAT after every import iteration and ran pmat-leakreport on three files (1st, 50th, and 100th iteration). I'm not sure if any of the potential leaks actually are leaks, but to me, it would not make too much sense to have that many matches of MongoDB instances being alive after 100 iterations ...

@zigouras
Copy link
Contributor

zigouras commented Nov 10, 2019

I ran the latest import csv script on all 1000 items and it only used 0.5 GB of memory and was not increasing on my laptop.
I also used Devel::MAT and the pmap program to analyze the memory dump after 900 records were inserted. I didn't see any suspicious objects, the largest SV was the strtab at about 30 MB. The rest of the objects were distributed pretty evenly in terms of size.

My Devel::MAT analysis:

pmat [more]> count
  Kind         Count (blessed)        Bytes (blessed)
  ARRAY       225525        40     22.2 MiB   3.5 KiB
  CODE         17291        10      2.1 MiB   1.2 KiB
  GLOB         31189         2      4.5 MiB 304 bytes
  HASH        321132      4435    200.6 MiB   1.3 MiB
  INVLIST        490                1.1 MiB          
  IO              35        35      5.5 KiB   5.5 KiB
  PAD          14600                4.8 MiB          
  REF         547867               12.5 MiB          
  REGEXP        3445       201    753.6 KiB  44.0 KiB
  SCALAR     3759076        17    198.2 MiB   1.2 KiB
  STASH         1780                1.5 MiB          
  -------    ------- ---------    --------- ---------
  (total)    4922430      4740    448.3 MiB   1.4 MiB

Thanks @hangy for his analysis. We may want to review how we use MongoDB in Perl. Database connections are always a likely place for performance issues. Do we use connection pooling?
I will keep looking.

@zigouras
Copy link
Contributor

zigouras commented Nov 10, 2019

It looks like certain Perl regular expressions can take up a lot of memory. If a regex contains $&, $` or $' it could be a big problem.
https://stackoverflow.com/questions/165660/why-is-my-perl-regex-using-so-much-memory

I am trying to get https://metacpan.org/pod/Devel::FindAmpersand installed but I am getting errors. It doesn't seem to work on perl later than 5.18.4 per the CPAN unit test so I am downloading that and trying to build it. If anyone can get Devel::FindAmpersand working it seems the best way to find the regex culprits in our code and all included modules.

Update: I built perl 5.18.4 and successfully installed Devel::FindAmpersand with it but couldn't get it to work with other project dependencies.
I used simple grep commands to find the ampersand regexes and found a lot in our ProductOpener library. See my report here. Refactoring these seems like a good option to try to reduce memory.
Update: It looks like Devel::NYTProf reports on these regexes. I will have to inspect its output again, I wasn't looking for it the first time.

@hangy
Copy link
Member

hangy commented Nov 11, 2019

We may want to review how we use MongoDB in Perl. Database connections are always a likely place for performance issues. Do we use connection pooling?

sub get_mongodb_client() {
holds a single $client reference. As per https://metacpan.org/pod/MongoDB::MongoClient, the actual socket is opened lazily and should be open until it goes out of scope. We don't really clean up the $client (which I think is for the web app to reuse the client?), but there shouldn't be any additional instances.

My Devel::MAT analysis shows that regexes use the most memory:

Really? It looks to be constant for me; After 50 products:

  Kind         Count (blessed)        Bytes (blessed)
  ARRAY       224130        39     20.7 MiB   3.4 KiB
  CODE         18558        10      2.3 MiB   1.2 KiB
  GLOB         33159         2      4.8 MiB 304 bytes
  HASH        230322      4451    129.2 MiB   1.3 MiB
  INVLIST        450              981.2 KiB
  IO              37        37      5.8 KiB   5.8 KiB
  PAD          15367                5.0 MiB
  REF         456683               10.5 MiB
  REGEXP        3491       196    763.7 KiB  42.9 KiB
  SCALAR     2206545        17    116.7 MiB   1.2 KiB
  STASH         1827                1.6 MiB
  -------    ------- ---------    --------- ---------
  (total)    3190569      4752    292.5 MiB   1.4 MiB

After 100 products:

pmat> count
  Kind         Count (blessed)        Bytes (blessed)
  ARRAY       224130        39     20.7 MiB   3.4 KiB
  CODE         18558        10      2.3 MiB   1.2 KiB
  GLOB         33159         2      4.8 MiB 304 bytes
  HASH        230322      4451    129.3 MiB   1.3 MiB
  INVLIST        450              981.2 KiB
  IO              37        37      5.8 KiB   5.8 KiB
  PAD          15367                5.0 MiB
  REF         456683               10.5 MiB
  REGEXP        3491       196    763.7 KiB  42.9 KiB
  SCALAR     2210165        17    116.9 MiB   1.2 KiB
  STASH         1827                1.6 MiB
  -------    ------- ---------    --------- ---------
  (total)    3194189      4752    292.7 MiB   1.4 MiB

Maybe I did something wrong (too small sample size might be a problem), but I only see a ~0.2 MiB increase in SCALAR.

@zigouras
Copy link
Contributor

zigouras commented Nov 11, 2019

@hangy Sorry I read the pmap count report wrong, I thought REGEXP was listed in Mb but it is Kb. I do not see it growing over your run.
I still think it is worth investigating the problematic regexes as described above. I am not completely sure how to interpret the output of pmac. It may be that the problematic regexes are storing look-back data as scalars which is why that is the biggest type.

stephanegigandet added a commit that referenced this issue Nov 11, 2019
remove useless calls to remove_tags_and_quote(), bug #2574 #2563
@stephanegigandet
Copy link
Contributor Author

Here is another file to import with 100 real products, with full ingredients lists etc. They trigger much more processing involving ingredients analysis, taxonomies etc.

import-real-products.100.csv.zip

@stephanegigandet
Copy link
Contributor Author

Something I forgot to mention: if you run the import multiple times, then the import script should detect that the product did not change, and there will be less processing (and the products won't be saved again in the disk and Mongo).

For the import script, the only MongoDB use should be when adding/updating each product.

@zigouras
Copy link
Contributor

Thanks for the update @stephanegigandet . I did my memory analysis on a run that was adding 1000 new test products to the system. I deleted all the products on disk and cleared the mongodb table before my run. I will try with your more extensive data set.
I think this is the most promising clue at this point: Regex memory issues

@stephanegigandet
Copy link
Contributor Author

@zigouras Indeed, as you found, I've been using $' and $`extensively, I wasn't aware that they caused memory issues. We can certainly try replacing them to see if it affects the memory usage, using variations of this:

"In general, apply /^(.)(pattern)(.)$/s and use $1 for $`, $2 for $& and $+ for $"

@hangy
Copy link
Member

hangy commented Nov 12, 2019

I think this is the most promising clue at this point: Regex memory issues

I agree that this should be tested. However, it's probably not as bad as the 2008 Stack Overflow answer makes it sound: According to perlvar, the gravest problems with those variables were supposedly fixed in 5.18 (2014) and 5.20 (2015):

In Perl 5.18.0 onwards, perl started noting the presence of each of the three variables separately, and only copied that part of the string required
In Perl 5.20.0 a new copy-on-write system was enabled by default, which finally fixes all performance issues with these three variables, and makes them safe to use anywhere.

@stephanegigandet
Copy link
Contributor Author

Performance wise (maybe not memory wise), it would probably be useful to see if we can optimize get_string_id_for_lang() which is by far the function that is called the most.

E.g. in one of my import test runs:

Calls P F ExclusiveTime InclusiveTime Subroutine
5889063 35 8 143s 174s ProductOpener::Store::get_string_id_for_lang
7453 1 1 52.6s 260s ProductOpener::Tags::spellcheck_taxonomy_tag
13140748 63 1 45.7s 45.7s ProductOpener::Ingredients::CORE:regcomp (opcode)
99 1 1 42.8s 465s ProductOpener::Ingredients::extract_ingredients_classes_from_text
71301254 30 1 31.5s 31.5s ProductOpener::Store::CORE:subst (opcode)

We could try to merge most of the =~ s// calls, and turn as many as possible into tr// calls instead which seem to be less costly.

@zigouras
Copy link
Contributor

zigouras commented Nov 13, 2019

Sounds good guys. Note that Devel::NYTProf reports on the problematic regexes. @hangy makes a good point that newer versions of Perl fixed the issue with these regex classes. However, I still think if we have memory issues it would be worth trying to refactor out these regexes to see if it improves memory.

Question: is the dev and production Perl compiled with thread support or not?

@VaiTon VaiTon changed the title Big memory leak when adding / updating / importing products - Needs investigation Big memory leak when adding / updating / importing products Nov 23, 2019
@github-actions
Copy link
Contributor

github-actions bot commented Jul 5, 2020

Stale issue message

@teolemon teolemon added the 🏭 Producers Platform https://wiki.openfoodfacts.org/Platform_for_producers label Dec 23, 2020
@manon-corneille manon-corneille added the Fixed ? This issue might have been fixed. We can close them once we verify that they have. label Jun 2, 2023
@manon-corneille
Copy link
Contributor

manon-corneille commented Dec 21, 2023

Should we close this issue?

@alexgarel
Copy link
Member

This issue should be tested to see if we still have the bug.

See bug description.

@manon-corneille manon-corneille moved this from Triage needed to To do in 🏭 Producers Platform Jan 23, 2024
@teolemon teolemon moved this to To discuss and validate in 🍊 Open Food Facts Server issues Apr 23, 2024
@teolemon teolemon removed the 🐛 bug This is a bug, not a feature request. label Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Data import Fixed ? This issue might have been fixed. We can close them once we verify that they have. needs investigation P2 🚅 Performance 🏭 Producers Platform https://wiki.openfoodfacts.org/Platform_for_producers
Projects
Status: To discuss and validate
Status: To do
Development

No branches or pull requests

7 participants