Problem with large training dataset #1368

BennoGesierich · 2023-03-31T12:57:27Z

BennoGesierich
Mar 31, 2023

Hello,

I have problems with training, when using a very large training sample. In particularly, I tried to train nnU-Net with 1470 datasets in the training sample, each comprising one T1w MRI brain scan as input modality and a label map with 1 label.
I tried also to reduce the number of images in the training sample to find the rough upper limit of datasets, which can be handled during training. Testing with 1050 datasets, it still didn't work, but when reducing to only 700 datasets in the training sample, the training worked fine.

The error message I get for very large training samples refers to the "multi_threaded_augmenter.py" and occurs usually during one of the first epochs. See below for a copy of the error message. I tried also to increase the memory allowed in the SLURM job to very large values (e.g. 1 task x 24 CPU x 20.000 MB), but also this didn't help.

Is there a strategy, how to work with such large training samples? For example, would deactivating data augmentation help and is there an easy way to do this?

Thank you for help and suggestions!
Benno

PS:
Here the an example of an entire error message:

FabianIsensee · 2023-03-31T14:54:57Z

FabianIsensee
Mar 31, 2023
Maintainer

Hi, nnU-Net's resource usage does not increase with more training cases, so there is no upper limit to the size of the dataset you can use. We have trained nnU-Nets with well over 4k training samples just fine. What is strange to me is that no real error message is given, all we see is that some background worker died an unexpected death.
Since you seem to still be on V1, can you please transfer your experiment to V2 and try again? The fix could be as simple as that

3 replies

BennoGesierich Apr 4, 2023
Author

Hi Fabian,

I solved the problem by inserting a longer delay between start of the first fold and all other folds.
So, most likely it didn't finish unpacking data in time, before starting the subsequent folds. The reported error occurred, however, in all folds, also the first one.

Many thanks for pointing me to the new Version V2. Although the problem is independent from the version, it triggered me to read the new documentation which gave me the hint to try longer delays. And, using the new version V2 makes my work much easier. I really appreciate it.

FabianIsensee Apr 6, 2023
Maintainer

Great to hear it works now.
It is expected to get errors in all folds because all folds simultaneously write to files that others are trying to read

jonathanjlau-hku May 13, 2024

Sorry for resurrecting an ancient post - I believe this may solve my earlier issue #2180. Nonetheless I'm struggling a bit to find where I could insert the fold-to-fold delay - could you please point me in the right direction?

Thanks so much - appreciate all your help!

YazdanSalimi · 2024-03-13T17:23:42Z

YazdanSalimi
Mar 13, 2024

Thank you for this great repository.
I have the storage issue, after unpacking dataset it becomes more than 1TB and I do not have enough SSD, is there a way to prevent producing npy images and load from compressed data? there is an option in monai nnunetv2runner --use_compressed_data but it is not working.
I have a big training data, I appreciate your help.

4 replies

ancestor-mithril Mar 14, 2024

Remove all npy files, then use --use_compressed for training. Check nnUNetv2_train -h.

YazdanSalimi Mar 14, 2024

Thank you for the respoonse.
I was trying it and it was still unpacking dataset, in the nnunet trainer folder/variants the defualt unpack data was True regradless of passed argument, I have changed the nnunetrainer_X_epochs and it is OK now. the option was working only with the default trainer.

LambdaLi May 13, 2024

@ancestor-mithril How can we not generate XXX.npy and XXX_seg.npy during preprocessing, instead of manually deleting them afterwards? When I use large datasets, the preprocessed files are very large.

Remove all npy files, then use --use_compressed for training. Check nnUNetv2_train -h.

ancestor-mithril May 13, 2024

@ancestor-mithril How can we not generate XXX.npy and XXX_seg.npy during preprocessing, instead of manually deleting them afterwards? When I use large datasets, the preprocessed files are very large.

During preprocessing only .npz files are created.
.npy files are created during training when unpacking the dataset.
Therefore, use the compressed option when starting the training and no npy files will be generated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with large training dataset #1368

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Problem with large training dataset #1368

BennoGesierich Mar 31, 2023

Replies: 2 comments · 7 replies

FabianIsensee Mar 31, 2023 Maintainer

BennoGesierich Apr 4, 2023 Author

FabianIsensee Apr 6, 2023 Maintainer

jonathanjlau-hku May 13, 2024

YazdanSalimi Mar 13, 2024

ancestor-mithril Mar 14, 2024

YazdanSalimi Mar 14, 2024

LambdaLi May 13, 2024

ancestor-mithril May 13, 2024

BennoGesierich
Mar 31, 2023

Replies: 2 comments 7 replies

FabianIsensee
Mar 31, 2023
Maintainer

BennoGesierich Apr 4, 2023
Author

FabianIsensee Apr 6, 2023
Maintainer

YazdanSalimi
Mar 13, 2024