Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for TPU devices #173

Open
vfdev-5 opened this issue Jul 20, 2021 · 18 comments · May be fixed by #201
Open

Add support for TPU devices #173

vfdev-5 opened this issue Jul 20, 2021 · 18 comments · May be fixed by #201
Labels
enhancement New feature or request hacktoberfest Hacktoberfest

Comments

@vfdev-5
Copy link
Member

vfdev-5 commented Jul 20, 2021

Clear and concise description of the problem

It would be good to provide an option to select accelerator as TPU instead of GPU
We can also auto-select TPU accelerator if open with Colab + add torch_xla installation steps.

What to do:
0) Try a template with TPUs. Choose distributed training option with 8 processes and spawning option. "Open in colab" one template, for example, vision classification template, install manually torch_xla (see https://colab.research.google.com/drive/1E9zJrptnLJ_PKhmaP5Vhb6DTVRvyrKHx) and run the code with backend xla-tpu: python main.py --nproc_per_node 8 --backend nccl. If everything is correctly done, training should probably run

  1. Update UI
  • Add a drop-out menu for backend selection: "nccl" and "xla-tpu" in "Training Options"
  • when user selects "xla-tpu", training should be only distributed with 8 processes and "Run the training with torch.multiprocessing.spawn".
  1. Update content: README.md and other impacted files
  2. if exported to Colab, we need to make sure that accelerator is "TPU"

Suggested solution

Alternative

Additional context

@afzal442
Copy link
Contributor

afzal442 commented Oct 3, 2021

hi there, trying to refactor it with a little digging into this following this way https://colab.research.google.com/github/pytorch/xla/blob/master/contrib/colab/getting-started.ipynb

We can allow it with this nature;
cc @vfdev-5

@afzal442
Copy link
Contributor

afzal442 commented Oct 3, 2021

something like this as follow;
image

@afzal442
Copy link
Contributor

afzal442 commented Oct 4, 2021

// @ydcjeff

@vfdev-5
Copy link
Member Author

vfdev-5 commented Jan 3, 2022

@sayantan1410 I updated issue description adding few initial steps on how I would tackle this issue.

@sayantan1410
Copy link
Contributor

@vfdev-5 I will start to work as per the description, and will let you know if I face some problem.

@sayantan1410
Copy link
Contributor

  1. Try a template with TPUs. Choose distributed training option with 8 processes and spawning option. "Open in colab" one template, for example, vision classification template, install manually torch_xla (see https://colab.research.google.com/drive/1E9zJrptnLJ_PKhmaP5Vhb6DTVRvyrKHx) and run the code with backend xla-tpu: python main.py --nproc_per_node 8 --backend nccl. If everything is correctly done, training should probably run

@vfdev-5 Hello, should i do this from the code-generator website or by running it locally or both works ?

@vfdev-5
Copy link
Member Author

vfdev-5 commented Jan 4, 2022

  1. Try a template with TPUs. Choose distributed training option with 8 processes and spawning option. "Open in colab" one template, for example, vision classification template, install manually torch_xla (see https://colab.research.google.com/drive/1E9zJrptnLJ_PKhmaP5Vhb6DTVRvyrKHx) and run the code with backend xla-tpu: python main.py --nproc_per_node 8 --backend nccl. If everything is correctly done, training should probably run

@vfdev-5 Hello, should i do this from the code-generator website or by running it locally or both works ?

From code-generator and exporting in colab as I explained. You can't test that locally as you should use TPUs.

@sayantan1410
Copy link
Contributor

@vfdev-5 Got it, Thank you.

@sayantan1410
Copy link
Contributor

@vfdev-5 Hello, I am facing a issue to run the colab notebook, I tried to install the torch_xla manually and then start the training, Here's the link to the notebook - https://colab.research.google.com/drive/15tlo1Js4vCXSDB5yqLQJ9byvwvtoEuEU?usp=sharing. The main problem that I am facing is that whenever I am running '!pip install -r requirements.txt' it is uninstalling the torch 1.9 version and reinstalling another version. But without installing the requirements.txt also it is not working.

Can you please check it out and let me know, what I am missing.

@vfdev-5
Copy link
Member Author

vfdev-5 commented Jan 5, 2022

@sayantan1410 can you please update your colab and show where you call !pip install -r requirements.txt and the output it gives. By the way, I also forgot in the description to mention that we need to set accelerator as TPU (looks like you already set it).

If you check the content of requirements.txt:

torch>=1.10.1
torchvision>=0.11.2
pytorch-ignite>=0.4.7
pyyaml

so, it is expected to reinstall torch. You need temporarily update it like below

- torch>=1.10.1
+ torch
- torchvision>=0.11.2
+ torchvision
pytorch-ignite>=0.4.7
pyyaml

@sayantan1410
Copy link
Contributor

sayantan1410 commented Jan 5, 2022

@vfdev-5 Okay I will try to remove the two from the requirements.txt and try. And setting the accelerator to TPU was written in the colab notebook that you linked in the description.

@sayantan1410
Copy link
Contributor

@vfdev-5 Hey, I was trying to change the requirements.txt from here
Screenshot (12)
But I cannot edit this. So I tried to install the other libraries manually, but the same problem is persisting. The colab link is same as the previous. Let me know if there is some other way to change the requirements.txt .

@vfdev-5
Copy link
Member Author

vfdev-5 commented Jan 6, 2022

@sayantan1410 the issue in colab is not with the dependencies but with the way you start trainings. Please read ignite docs on idist.Parallel and also see the step 1 and 2 of this issue description: you have to use another backend: xla-tpu instead of None

@sayantan1410
Copy link
Contributor

sayantan1410 commented Jan 6, 2022

@vfdev-5 Hey I tried to run the code but getting a "Aborted: Session 96f4ae2c056673d1 is not found" error. Can I start working on the UI in the meantime while I am trying to solve the colab issue ?
Link to the notebook - https://colab.research.google.com/drive/15tlo1Js4vCXSDB5yqLQJ9byvwvtoEuEU?usp=sharing

@vfdev-5
Copy link
Member Author

vfdev-5 commented Jan 6, 2022

Can I start working on the UI in the meantime while I am trying to solve the colab issue ?

yes, that's the final goal of the issue. The work with colab is a step 0 to check if things could work.

@sayantan1410
Copy link
Contributor

@vfdev-5 > "Aborted: Session 96f4ae2c056673d1 is not found"

Can you please check once why is this coming ?

@vfdev-5
Copy link
Member Author

vfdev-5 commented Jan 6, 2022

Looks like an internal issue with TPUs on Colab, try to do "factory reset runtime" and see if the issue persists. If you have a Kaggle account, you can also check the same code on their TPUs.

@sayantan1410
Copy link
Contributor

@vfdev-5 I tried doing "factory reset runtime", but that did not work. I will try running it on Kaggle notebooks.
Also for the UI update part, I have done something like this -
Screenshot (14)
Should I populate the dropdown by creating "backend.json" and then calling it in " TabTemplates.vue ", Or there is a better way ?
Also what is the next step ?

@sayantan1410 sayantan1410 linked a pull request Jan 10, 2022 that will close this issue
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request hacktoberfest Hacktoberfest
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants