Bugfix for consecutive training steps using the same batch #202

oadams · 2024-09-16T05:26:56Z

There was a bug where consecutive training steps would use the same batch. The number of consecutive training steps using the same batch is equal to n_cpu. This is to do with the number of workers prefetching data in the dataloader, and somehow each of them independently keeping track of their own data_counter member in mmap_batch_generator. I don't completely understand how the prefetching caused this, all I know is that it did.

The fix substantially changes the training dynamics by bringing a lot more stochasticity into training. In particular training error is much lower and validation accuracy and recall are much higher. It's noteworthy that validation false positives are also much higher. The training logic in this codebase is quite idiosyncratic so it's likely some of the settings and logic need to be recalibrated to work well with correct minibatch SGD.

Depending on ones settings for n_steps, dataset size and number of CPUs it's also much more likely without the fix that the model won't see all the training examples. For example, if you have 100,000 positive examples and are using 50 positive examples per batch, then it should take 2,000 steps for an epoch. But if you're on a machine with 56 CPUs then without the fix your model won't see all the data because it takes 56x more steps to see the same amount of unique data samples.

Some indicative training profiles are below on my own data. Grey is without the fix, blue is with the fix.

Note that I made a number of other changes to the training code as part of the runs that gave the above profile. Most notably disabling the max_negative_weight scaling, so it's merely indicative. A comparison based on the state of the code in this PR is made below, after reverting my other changes:

--	Accuracy	Recall	FPPH
No fix	0.674	0.350	1.150
Fix	0.764	0.534	17.080

There was a bug where consecutive training steps would use the same batch. The number of training steps using the same batch is equal to `n_cpu`. This is to do with the number of workers prefetching data in the dataloader, and somehow each of them independently keeping track of their own `data_counter` member in `mmap_batch_generator`. I don't completely understand how the prefetching caused this, all I know is that it did. The fix substantially changes the training dynamics by bringing stochasticity back into training. In particular training error is much lower and validation accuracy and recall are much higher. It's noteworthy that validation false positives are also much higher. The training logic in this codebase is quite idiosyncratic so it's likely some of the settings and logic need to be recalibrated to work well with correct minibatch SGD to minimize false positives.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix for consecutive training steps using the same batch #202

Bugfix for consecutive training steps using the same batch #202

oadams commented Sep 16, 2024 •

edited

Loading

Bugfix for consecutive training steps using the same batch #202

Are you sure you want to change the base?

Bugfix for consecutive training steps using the same batch #202

Conversation

oadams commented Sep 16, 2024 • edited Loading

oadams commented Sep 16, 2024 •

edited

Loading