You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello,
I was training with RT-DETRv2 with multiple GPUs. The training got stucked when a batch of data on a GPU are all background images (i.e., no ground-truth/target boxes), the volatile gpu-util became 100%, and there were no errors or warnings reported. The training was just stucked there and did not go on. In such case, the VFL loss is very small (0.0528), l1 and giou box loss are all 0. The training was stucked in the scaler.backward or scaler.step (or scaler.update maybe) process. When there are some target boxes in a batch data, the training is fine. By the way, when training with gloo backend, it was stucked in the same occasion, and the following error was reported:
Thanks for attention!
The text was updated successfully, but these errors were encountered:
Hello,
I was training with RT-DETRv2 with multiple GPUs. The training got stucked when a batch of data on a GPU are all background images (i.e., no ground-truth/target boxes), the volatile gpu-util became 100%, and there were no errors or warnings reported. The training was just stucked there and did not go on. In such case, the VFL loss is very small (0.0528), l1 and giou box loss are all 0. The training was stucked in the scaler.backward or scaler.step (or scaler.update maybe) process. When there are some target boxes in a batch data, the training is fine. By the way, when training with gloo backend, it was stucked in the same occasion, and the following error was reported:
Thanks for attention!
The text was updated successfully, but these errors were encountered: