Torch.save() for large training Dataset #56

stelmath · 2023-03-30T08:26:13Z

Hello,

I know this is not directly related to awesome-align, but I have a large training set of 10M source/target pairs and it takes 4 hours to process them before the actual model training even starts. So, I used the --cache_data option provided, but when the code reaches torch.save(self.examples, cache_fn) it just takes forever and it uses all my RAM (60GB is not enough on a VM - it swaps into HDD on my 32gb RAM PC and then takes forever to complete). Would you know any alternative way to save the Dataset and then reuse it before training to skip the loading overhead time?

Many thanks

The text was updated successfully, but these errors were encountered:

zdou0830 · 2023-04-28T17:33:19Z

Hello, thanks! I think you can subsample the data as it wouldn't require a lot of parallel data to finetune BERT for word alignment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Torch.save() for large training Dataset #56

Torch.save() for large training Dataset #56

stelmath commented Mar 30, 2023

zdou0830 commented Apr 28, 2023

Torch.save() for large training Dataset #56

Torch.save() for large training Dataset #56

Comments

stelmath commented Mar 30, 2023

zdou0830 commented Apr 28, 2023