Add support for TPU devices #173

vfdev-5 · 2021-07-20T22:36:23Z

Clear and concise description of the problem

It would be good to provide an option to select accelerator as TPU instead of GPU
We can also auto-select TPU accelerator if open with Colab + add torch_xla installation steps.

What to do:
0) Try a template with TPUs. Choose distributed training option with 8 processes and spawning option. "Open in colab" one template, for example, vision classification template, install manually torch_xla (see https://colab.research.google.com/drive/1E9zJrptnLJ_PKhmaP5Vhb6DTVRvyrKHx) and run the code with backend xla-tpu: python main.py --nproc_per_node 8 --backend nccl. If everything is correctly done, training should probably run

Update UI

Add a drop-out menu for backend selection: "nccl" and "xla-tpu" in "Training Options"
when user selects "xla-tpu", training should be only distributed with 8 processes and "Run the training with torch.multiprocessing.spawn".

Update content: README.md and other impacted files
if exported to Colab, we need to make sure that accelerator is "TPU"

Alternative

Additional context

The text was updated successfully, but these errors were encountered:

afzal442 · 2021-10-03T07:09:21Z

hi there, trying to refactor it with a little digging into this following this way https://colab.research.google.com/github/pytorch/xla/blob/master/contrib/colab/getting-started.ipynb

We can allow it with this nature;
cc @vfdev-5

afzal442 · 2021-10-03T12:55:50Z

something like this as follow;

afzal442 · 2021-10-04T05:12:22Z

// @ydcjeff

vfdev-5 · 2022-01-03T09:26:19Z

@sayantan1410 I updated issue description adding few initial steps on how I would tackle this issue.

sayantan1410 · 2022-01-03T09:33:30Z

@vfdev-5 I will start to work as per the description, and will let you know if I face some problem.

sayantan1410 · 2022-01-04T04:26:13Z

Try a template with TPUs. Choose distributed training option with 8 processes and spawning option. "Open in colab" one template, for example, vision classification template, install manually torch_xla (see https://colab.research.google.com/drive/1E9zJrptnLJ_PKhmaP5Vhb6DTVRvyrKHx) and run the code with backend xla-tpu: python main.py --nproc_per_node 8 --backend nccl. If everything is correctly done, training should probably run

@vfdev-5 Hello, should i do this from the code-generator website or by running it locally or both works ?

vfdev-5 · 2022-01-04T07:37:33Z

Try a template with TPUs. Choose distributed training option with 8 processes and spawning option. "Open in colab" one template, for example, vision classification template, install manually torch_xla (see https://colab.research.google.com/drive/1E9zJrptnLJ_PKhmaP5Vhb6DTVRvyrKHx) and run the code with backend xla-tpu: python main.py --nproc_per_node 8 --backend nccl. If everything is correctly done, training should probably run

@vfdev-5 Hello, should i do this from the code-generator website or by running it locally or both works ?

From code-generator and exporting in colab as I explained. You can't test that locally as you should use TPUs.

sayantan1410 · 2022-01-04T07:47:35Z

@vfdev-5 Got it, Thank you.

sayantan1410 · 2022-01-05T05:31:12Z

@vfdev-5 Hello, I am facing a issue to run the colab notebook, I tried to install the torch_xla manually and then start the training, Here's the link to the notebook - https://colab.research.google.com/drive/15tlo1Js4vCXSDB5yqLQJ9byvwvtoEuEU?usp=sharing. The main problem that I am facing is that whenever I am running '!pip install -r requirements.txt' it is uninstalling the torch 1.9 version and reinstalling another version. But without installing the requirements.txt also it is not working.

Can you please check it out and let me know, what I am missing.

vfdev-5 · 2022-01-05T09:23:55Z

@sayantan1410 can you please update your colab and show where you call !pip install -r requirements.txt and the output it gives. By the way, I also forgot in the description to mention that we need to set accelerator as TPU (looks like you already set it).

If you check the content of requirements.txt:

torch>=1.10.1
torchvision>=0.11.2
pytorch-ignite>=0.4.7
pyyaml

so, it is expected to reinstall torch. You need temporarily update it like below

- torch>=1.10.1
+ torch
- torchvision>=0.11.2
+ torchvision
pytorch-ignite>=0.4.7
pyyaml

sayantan1410 · 2022-01-05T11:27:36Z

@vfdev-5 Okay I will try to remove the two from the requirements.txt and try. And setting the accelerator to TPU was written in the colab notebook that you linked in the description.

sayantan1410 · 2022-01-06T04:53:39Z

@vfdev-5 Hey, I was trying to change the requirements.txt from here

But I cannot edit this. So I tried to install the other libraries manually, but the same problem is persisting. The colab link is same as the previous. Let me know if there is some other way to change the requirements.txt .

vfdev-5 · 2022-01-06T08:37:47Z

@sayantan1410 the issue in colab is not with the dependencies but with the way you start trainings. Please read ignite docs on idist.Parallel and also see the step 1 and 2 of this issue description: you have to use another backend: xla-tpu instead of None

sayantan1410 · 2022-01-06T11:08:02Z

@vfdev-5 Hey I tried to run the code but getting a "Aborted: Session 96f4ae2c056673d1 is not found" error. Can I start working on the UI in the meantime while I am trying to solve the colab issue ?
Link to the notebook - https://colab.research.google.com/drive/15tlo1Js4vCXSDB5yqLQJ9byvwvtoEuEU?usp=sharing

vfdev-5 · 2022-01-06T11:22:25Z

Can I start working on the UI in the meantime while I am trying to solve the colab issue ?

yes, that's the final goal of the issue. The work with colab is a step 0 to check if things could work.

sayantan1410 · 2022-01-06T11:39:53Z

@vfdev-5 > "Aborted: Session 96f4ae2c056673d1 is not found"

Can you please check once why is this coming ?

vfdev-5 · 2022-01-06T11:42:34Z

Looks like an internal issue with TPUs on Colab, try to do "factory reset runtime" and see if the issue persists. If you have a Kaggle account, you can also check the same code on their TPUs.

sayantan1410 · 2022-01-07T05:20:42Z

@vfdev-5 I tried doing "factory reset runtime", but that did not work. I will try running it on Kaggle notebooks.
Also for the UI update part, I have done something like this -

Should I populate the dropdown by creating "backend.json" and then calling it in " TabTemplates.vue ", Or there is a better way ?
Also what is the next step ?

vfdev-5 mentioned this issue Jul 20, 2021

Adding one source multi use example code on google's colab to examples/contrib/osmu pytorch/ignite#2121

Closed

ydcjeff added enhancement New feature or request good first issue Good for newcomers labels Jul 27, 2021

vfdev-5 added the hacktoberfest Hacktoberfest label Oct 1, 2021

ydcjeff added the PyDataGlobal label Oct 21, 2021

vfdev-5 removed the good first issue Good for newcomers label Oct 27, 2021

ydcjeff removed the PyDataGlobal label Oct 30, 2021

vfdev-5 mentioned this issue Jan 3, 2022

Ideas for Code Generator #177

Open

sayantan1410 linked a pull request Jan 10, 2022 that will close this issue

Updated the UI for TPU support #201

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for TPU devices #173

Add support for TPU devices #173

vfdev-5 commented Jul 20, 2021 •

edited

Loading

afzal442 commented Oct 3, 2021 •

edited

Loading

afzal442 commented Oct 3, 2021

afzal442 commented Oct 4, 2021

vfdev-5 commented Jan 3, 2022

sayantan1410 commented Jan 3, 2022

sayantan1410 commented Jan 4, 2022

vfdev-5 commented Jan 4, 2022

sayantan1410 commented Jan 4, 2022

sayantan1410 commented Jan 5, 2022

vfdev-5 commented Jan 5, 2022 •

edited

Loading

sayantan1410 commented Jan 5, 2022 •

edited

Loading

sayantan1410 commented Jan 6, 2022

vfdev-5 commented Jan 6, 2022 •

edited

Loading

sayantan1410 commented Jan 6, 2022 •

edited

Loading

vfdev-5 commented Jan 6, 2022

sayantan1410 commented Jan 6, 2022

vfdev-5 commented Jan 6, 2022

sayantan1410 commented Jan 7, 2022

Add support for TPU devices #173

Add support for TPU devices #173

Comments

vfdev-5 commented Jul 20, 2021 • edited Loading

Clear and concise description of the problem

Suggested solution

Alternative

Additional context

afzal442 commented Oct 3, 2021 • edited Loading

afzal442 commented Oct 3, 2021

afzal442 commented Oct 4, 2021

vfdev-5 commented Jan 3, 2022

sayantan1410 commented Jan 3, 2022

sayantan1410 commented Jan 4, 2022

vfdev-5 commented Jan 4, 2022

sayantan1410 commented Jan 4, 2022

sayantan1410 commented Jan 5, 2022

vfdev-5 commented Jan 5, 2022 • edited Loading

sayantan1410 commented Jan 5, 2022 • edited Loading

sayantan1410 commented Jan 6, 2022

vfdev-5 commented Jan 6, 2022 • edited Loading

sayantan1410 commented Jan 6, 2022 • edited Loading

vfdev-5 commented Jan 6, 2022

sayantan1410 commented Jan 6, 2022

vfdev-5 commented Jan 6, 2022

sayantan1410 commented Jan 7, 2022

vfdev-5 commented Jul 20, 2021 •

edited

Loading

afzal442 commented Oct 3, 2021 •

edited

Loading

vfdev-5 commented Jan 5, 2022 •

edited

Loading

sayantan1410 commented Jan 5, 2022 •

edited

Loading

vfdev-5 commented Jan 6, 2022 •

edited

Loading

sayantan1410 commented Jan 6, 2022 •

edited

Loading