-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for TPU devices #173
Comments
hi there, trying to refactor it with a little digging into this following this way https://colab.research.google.com/github/pytorch/xla/blob/master/contrib/colab/getting-started.ipynb We can allow it with this nature; |
// @ydcjeff |
@sayantan1410 I updated issue description adding few initial steps on how I would tackle this issue. |
@vfdev-5 I will start to work as per the description, and will let you know if I face some problem. |
@vfdev-5 Hello, should i do this from the code-generator website or by running it locally or both works ? |
From code-generator and exporting in colab as I explained. You can't test that locally as you should use TPUs. |
@vfdev-5 Got it, Thank you. |
@vfdev-5 Hello, I am facing a issue to run the colab notebook, I tried to install the torch_xla manually and then start the training, Here's the link to the notebook - https://colab.research.google.com/drive/15tlo1Js4vCXSDB5yqLQJ9byvwvtoEuEU?usp=sharing. The main problem that I am facing is that whenever I am running '!pip install -r requirements.txt' it is uninstalling the torch 1.9 version and reinstalling another version. But without installing the requirements.txt also it is not working. Can you please check it out and let me know, what I am missing. |
@sayantan1410 can you please update your colab and show where you call If you check the content of torch>=1.10.1
torchvision>=0.11.2
pytorch-ignite>=0.4.7
pyyaml so, it is expected to reinstall - torch>=1.10.1
+ torch
- torchvision>=0.11.2
+ torchvision
pytorch-ignite>=0.4.7
pyyaml |
@vfdev-5 Okay I will try to remove the two from the requirements.txt and try. And setting the accelerator to TPU was written in the colab notebook that you linked in the description. |
@vfdev-5 Hey, I was trying to change the requirements.txt from here |
@sayantan1410 the issue in colab is not with the dependencies but with the way you start trainings. Please read ignite docs on |
@vfdev-5 Hey I tried to run the code but getting a "Aborted: Session 96f4ae2c056673d1 is not found" error. Can I start working on the UI in the meantime while I am trying to solve the colab issue ? |
yes, that's the final goal of the issue. The work with colab is a step 0 to check if things could work. |
@vfdev-5 > "Aborted: Session 96f4ae2c056673d1 is not found" Can you please check once why is this coming ? |
Looks like an internal issue with TPUs on Colab, try to do "factory reset runtime" and see if the issue persists. If you have a Kaggle account, you can also check the same code on their TPUs. |
@vfdev-5 I tried doing "factory reset runtime", but that did not work. I will try running it on Kaggle notebooks. |
Clear and concise description of the problem
It would be good to provide an option to select accelerator as TPU instead of GPU
We can also auto-select TPU accelerator if open with Colab + add torch_xla installation steps.
What to do:
0) Try a template with TPUs. Choose distributed training option with 8 processes and spawning option. "Open in colab" one template, for example, vision classification template, install manually torch_xla (see https://colab.research.google.com/drive/1E9zJrptnLJ_PKhmaP5Vhb6DTVRvyrKHx) and run the code with backend
xla-tpu
:python main.py --nproc_per_node 8 --backend nccl
. If everything is correctly done, training should probably runSuggested solution
Alternative
Additional context
The text was updated successfully, but these errors were encountered: