Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix for consecutive training steps using the same batch #202

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

oadams
Copy link

@oadams oadams commented Sep 16, 2024

There was a bug where consecutive training steps would use the same batch. The number of consecutive training steps using the same batch is equal to n_cpu. This is to do with the number of workers prefetching data in the dataloader, and somehow each of them independently keeping track of their own data_counter member in mmap_batch_generator. I don't completely understand how the prefetching caused this, all I know is that it did.

The fix substantially changes the training dynamics by bringing a lot more stochasticity into training. In particular training error is much lower and validation accuracy and recall are much higher. It's noteworthy that validation false positives are also much higher. The training logic in this codebase is quite idiosyncratic so it's likely some of the settings and logic need to be recalibrated to work well with correct minibatch SGD.

Depending on ones settings for n_steps, dataset size and number of CPUs it's also much more likely without the fix that the model won't see all the training examples. For example, if you have 100,000 positive examples and are using 50 positive examples per batch, then it should take 2,000 steps for an epoch. But if you're on a machine with 56 CPUs then without the fix your model won't see all the data because it takes 56x more steps to see the same amount of unique data samples.

Some indicative training profiles are below on my own data. Grey is without the fix, blue is with the fix.

Screenshot 2024-09-16 at 3 13 01 PM Screenshot 2024-09-16 at 3 14 48 PM

Note that I made a number of other changes to the training code as part of the runs that gave the above profile. Most notably disabling the max_negative_weight scaling, so it's merely indicative. A comparison based on the state of the code in this PR is made below, after reverting my other changes:

<style type="text/css"></style>

-- Accuracy Recall FPPH
No fix 0.674 0.350 1.150
Fix 0.764 0.534 17.080

There was a bug where consecutive training steps would use the same batch. The
number of training steps using the same batch is equal to `n_cpu`. This is to
do with the number of workers prefetching data in the dataloader, and somehow
each of them independently keeping track of their own `data_counter` member in
`mmap_batch_generator`. I don't completely understand how the prefetching
caused this, all I know is that it did.

The fix substantially changes the training dynamics by bringing stochasticity
back into training. In particular training error is much lower and validation
accuracy and recall are much higher. It's noteworthy that validation false
positives are also much higher. The training logic in this codebase is quite
idiosyncratic so it's likely some of the settings and logic need to be
recalibrated to work well with correct minibatch SGD to minimize false
positives.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant