Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Torch.save() for large training Dataset #56

Open
stelmath opened this issue Mar 30, 2023 · 1 comment
Open

Torch.save() for large training Dataset #56

stelmath opened this issue Mar 30, 2023 · 1 comment

Comments

@stelmath
Copy link

Hello,

I know this is not directly related to awesome-align, but I have a large training set of 10M source/target pairs and it takes 4 hours to process them before the actual model training even starts. So, I used the --cache_data option provided, but when the code reaches torch.save(self.examples, cache_fn) it just takes forever and it uses all my RAM (60GB is not enough on a VM - it swaps into HDD on my 32gb RAM PC and then takes forever to complete). Would you know any alternative way to save the Dataset and then reuse it before training to skip the loading overhead time?

Many thanks

@zdou0830
Copy link
Collaborator

Hello, thanks! I think you can subsample the data as it wouldn't require a lot of parallel data to finetune BERT for word alignment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants