Bugfix for consecutive training steps using the same batch #202
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a bug where consecutive training steps would use the same batch. The number of consecutive training steps using the same batch is equal to
n_cpu
. This is to do with the number of workers prefetching data in the dataloader, and somehow each of them independently keeping track of their owndata_counter
member inmmap_batch_generator
. I don't completely understand how the prefetching caused this, all I know is that it did.The fix substantially changes the training dynamics by bringing a lot more stochasticity into training. In particular training error is much lower and validation accuracy and recall are much higher. It's noteworthy that validation false positives are also much higher. The training logic in this codebase is quite idiosyncratic so it's likely some of the settings and logic need to be recalibrated to work well with correct minibatch SGD.
Depending on ones settings for n_steps, dataset size and number of CPUs it's also much more likely without the fix that the model won't see all the training examples. For example, if you have 100,000 positive examples and are using 50 positive examples per batch, then it should take 2,000 steps for an epoch. But if you're on a machine with 56 CPUs then without the fix your model won't see all the data because it takes 56x more steps to see the same amount of unique data samples.
Some indicative training profiles are below on my own data. Grey is without the fix, blue is with the fix.
Note that I made a number of other changes to the training code as part of the runs that gave the above profile. Most notably disabling the max_negative_weight scaling, so it's merely indicative. A comparison based on the state of the code in this PR is made below, after reverting my other changes:
<style type="text/css"></style>