Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed training on multiple GPUs get stucked. #529

Open
NotoCJ opened this issue Dec 30, 2024 · 1 comment
Open

Distributed training on multiple GPUs get stucked. #529

NotoCJ opened this issue Dec 30, 2024 · 1 comment
Assignees

Comments

@NotoCJ
Copy link

NotoCJ commented Dec 30, 2024

Hello,
I was training with RT-DETRv2 with multiple GPUs. The training got stucked when a batch of data on a GPU are all background images (i.e., no ground-truth/target boxes), the volatile gpu-util became 100%, and there were no errors or warnings reported. The training was just stucked there and did not go on. In such case, the VFL loss is very small (0.0528), l1 and giou box loss are all 0. The training was stucked in the scaler.backward or scaler.step (or scaler.update maybe) process. When there are some target boxes in a batch data, the training is fine. By the way, when training with gloo backend, it was stucked in the same occasion, and the following error was reported:
image

Thanks for attention!

@lyuwenyu
Copy link
Owner

  1. You can filter relevant samples before training
  2. or try loss_box = pred_box *= 0 in that case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants