Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

matcher 偶尔会抛出一个assert错误assert (boxes1[:, 2:] >= boxes1[:, :2]).all() #539

Open
jeycechen opened this issue Jan 17, 2025 · 3 comments
Assignees

Comments

@jeycechen
Copy link

Star RTDETR
请先在RTDETR主页点击star以支持本项目
Star RTDETR to help more people discover this project.


Describe the bug
Epoch: [24] [3700/4421] eta: 0:04:24 lr: 0.000010 loss: 19.0368 (17.4353) loss_vfl: 0.6064 (0.7137) loss_bbox: 0.0663 (0.1036) loss_giou: 0.5338 (0.5874) loss_vfl_aux_0: 0.7402 (0.8096) loss_bbox_aux_0: 0.0729 (0.1187) loss_giou_aux_0: 0.6301 (0.6288) loss_vfl_aux_1: 0.6860 (0.7895) loss_bbox_aux_1: 0.0646 (0.1089) loss_giou_aux_1: 0.5190 (0.6023) loss_vfl_aux_2: 0.6597 (0.7459) loss_bbox_aux_2: 0.0668 (0.1054) loss_giou_aux_2: 0.5340 (0.5916) loss_vfl_aux_3: 0.6401 (0.7211) loss_bbox_aux_3: 0.0681 (0.1042) loss_giou_aux_3: 0.5312 (0.5888) loss_vfl_aux_4: 0.6299 (0.7149) loss_bbox_aux_4: 0.0665 (0.1037) loss_giou_aux_4: 0.5331 (0.5876) loss_vfl_aux_5: 0.7495 (0.8039) loss_bbox_aux_5: 0.0975 (0.1549) loss_giou_aux_5: 0.7090 (0.7207) loss_vfl_dn_0: 0.5093 (0.5377) loss_bbox_dn_0: 0.0742 (0.1434) loss_giou_dn_0: 0.6288 (0.6582) loss_vfl_dn_1: 0.4673 (0.4842) loss_bbox_dn_1: 0.0617 (0.1143) loss_giou_dn_1: 0.5375 (0.5704) loss_vfl_dn_2: 0.4539 (0.4721) loss_bbox_dn_2: 0.0590 (0.1089) loss_giou_dn_2: 0.5288 (0.5569) loss_vfl_dn_3: 0.4536 (0.4660) loss_bbox_dn_3: 0.0586 (0.1078) loss_giou_dn_3: 0.5254 (0.5546) loss_vfl_dn_4: 0.4468 (0.4654) loss_bbox_dn_4: 0.0587 (0.1077) loss_giou_dn_4: 0.5274 (0.5548) loss_vfl_dn_5: 0.4519 (0.4665) loss_bbox_dn_5: 0.0587 (0.1078) loss_giou_dn_5: 0.5290 (0.5554) time: 0.3496 data: 0.0037 max mem: 17856
Epoch: [24] [3800/4421] eta: 0:03:48 lr: 0.000010 loss: 16.5786 (17.4315) loss_vfl: 0.7236 (0.7142) loss_bbox: 0.0596 (0.1032) loss_giou: 0.3736 (0.5870) loss_vfl_aux_0: 0.7646 (0.8101) loss_bbox_aux_0: 0.0631 (0.1182) loss_giou_aux_0: 0.3962 (0.6285) loss_vfl_aux_1: 0.7769 (0.7899) loss_bbox_aux_1: 0.0598 (0.1085) loss_giou_aux_1: 0.3695 (0.6020) loss_vfl_aux_2: 0.7441 (0.7465) loss_bbox_aux_2: 0.0590 (0.1050) loss_giou_aux_2: 0.3621 (0.5912) loss_vfl_aux_3: 0.7363 (0.7217) loss_bbox_aux_3: 0.0585 (0.1037) loss_giou_aux_3: 0.3685 (0.5884) loss_vfl_aux_4: 0.7446 (0.7157) loss_bbox_aux_4: 0.0583 (0.1033) loss_giou_aux_4: 0.3765 (0.5872) loss_vfl_aux_5: 0.7554 (0.8046) loss_bbox_aux_5: 0.0830 (0.1542) loss_giou_aux_5: 0.4614 (0.7201) loss_vfl_dn_0: 0.5054 (0.5377) loss_bbox_dn_0: 0.0667 (0.1430) loss_giou_dn_0: 0.5499 (0.6583) loss_vfl_dn_1: 0.4458 (0.4842) loss_bbox_dn_1: 0.0603 (0.1139) loss_giou_dn_1: 0.4702 (0.5704) loss_vfl_dn_2: 0.4360 (0.4722) loss_bbox_dn_2: 0.0594 (0.1086) loss_giou_dn_2: 0.4544 (0.5569) loss_vfl_dn_3: 0.4346 (0.4661) loss_bbox_dn_3: 0.0594 (0.1074) loss_giou_dn_3: 0.4574 (0.5545) loss_vfl_dn_4: 0.4363 (0.4655) loss_bbox_dn_4: 0.0593 (0.1074) loss_giou_dn_4: 0.4564 (0.5548) loss_vfl_dn_5: 0.4355 (0.4665) loss_bbox_dn_5: 0.0592 (0.1074) loss_giou_dn_5: 0.4564 (0.5554) time: 0.3734 data: 0.0040 max mem: 17856
Epoch: [24] [3900/4421] eta: 0:03:11 lr: 0.000010 loss: 18.1389 (17.4234) loss_vfl: 0.6440 (0.7142) loss_bbox: 0.0591 (0.1030) loss_giou: 0.5909 (0.5868) loss_vfl_aux_0: 0.6987 (0.8103) loss_bbox_aux_0: 0.0666 (0.1180) loss_giou_aux_0: 0.6362 (0.6282) loss_vfl_aux_1: 0.7231 (0.7900) loss_bbox_aux_1: 0.0646 (0.1082) loss_giou_aux_1: 0.5919 (0.6018) loss_vfl_aux_2: 0.6631 (0.7463) loss_bbox_aux_2: 0.0645 (0.1048) loss_giou_aux_2: 0.5722 (0.5911) loss_vfl_aux_3: 0.6465 (0.7217) loss_bbox_aux_3: 0.0597 (0.1036) loss_giou_aux_3: 0.5741 (0.5882) loss_vfl_aux_4: 0.6460 (0.7159) loss_bbox_aux_4: 0.0596 (0.1031) loss_giou_aux_4: 0.5903 (0.5870) loss_vfl_aux_5: 0.7163 (0.8045) loss_bbox_aux_5: 0.0816 (0.1540) loss_giou_aux_5: 0.6991 (0.7196) loss_vfl_dn_0: 0.5063 (0.5375) loss_bbox_dn_0: 0.0745 (0.1428) loss_giou_dn_0: 0.6293 (0.6578) loss_vfl_dn_1: 0.4663 (0.4840) loss_bbox_dn_1: 0.0632 (0.1137) loss_giou_dn_1: 0.5524 (0.5699) loss_vfl_dn_2: 0.4607 (0.4720) loss_bbox_dn_2: 0.0550 (0.1083) loss_giou_dn_2: 0.5654 (0.5564) loss_vfl_dn_3: 0.4641 (0.4659) loss_bbox_dn_3: 0.0548 (0.1072) loss_giou_dn_3: 0.5649 (0.5541) loss_vfl_dn_4: 0.4619 (0.4652) loss_bbox_dn_4: 0.0547 (0.1072) loss_giou_dn_4: 0.5631 (0.5543) loss_vfl_dn_5: 0.4612 (0.4663) loss_bbox_dn_5: 0.0547 (0.1072) loss_giou_dn_5: 0.5624 (0.5549) time: 0.3582 data: 0.0037 max mem: 17856
Traceback (most recent call last):
File "tools/train.py", line 51, in
main(args)
File "tools/train.py", line 37, in main
solver.fit()
File "/home/amax/ckl/OI-RT-DETR/oi-rtdetr-pytorch/tools/../src/solver/det_solver.py", line 37, in fit
train_stats = train_one_epoch(
File "/home/amax/ckl/OI-RT-DETR/oi-rtdetr-pytorch/tools/../src/solver/det_engine.py", line 46, in train_one_epoch
loss_dict = criterion(outputs, targets)
File "/home/amax/miniconda3/envs/rtdetr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/amax/ckl/OI-RT-DETR/oi-rtdetr-pytorch/tools/../src/zoo/rtdetr/rtdetr_criterion.py", line 238, in forward
indices = self.matcher(outputs_without_aux, targets)
File "/home/amax/miniconda3/envs/rtdetr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/amax/miniconda3/envs/rtdetr/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/amax/ckl/OI-RT-DETR/oi-rtdetr-pytorch/tools/../src/zoo/rtdetr/matcher.py", line 99, in forward
cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox))
File "/home/amax/ckl/OI-RT-DETR/oi-rtdetr-pytorch/tools/../src/zoo/rtdetr/box_ops.py", line 52, in generalized_box_iou
assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
AssertionError

作者您好,冒昧打扰了,如报错信息所示,在box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox) 这这一行cxcywh转换为xyxy的时候似乎会发生错误,但是这个时候已经训练了24个epoch,也就是tgt_bbox 这里应该不会在转换的时候发生错误,也就是out_bbox有一个box不满足xyxy的后者点坐标大于前者的格式,也就是模型预测输出的cxcywh有负数值???

To Reproduce
比较难以复现,偶发性错误,有的时候从这个epoch训练报assert错误之后,接着--resume选项继续训练,这个epoch又能训练过去了,比如上文中的epoch 24报错,我接着epoch23的pth继续训练,结果又能通过epoch24的训练了,本人愚钝,不知道如何缓解或者解决这个问题,希望作者看到了如果有时间的话能提供一些建议,谢谢~

@lyuwenyu
Copy link
Owner

lyuwenyu commented Jan 17, 2025

那有点奇怪;可以加个try打印看下具体的值; 然后可以对 out_bbox做一些clip的操作限制下范围

@jeycechen
Copy link
Author

好的, 感谢作者大大~

@jeycechen
Copy link
Author

那有点奇怪;可以加个try打印看下具体的值; 然后可以对 out_bbox做一些clip的操作限制下范围

不好意思,冒昧再次打扰,试了一下,发现是有一个batch的输出是 [300,4] 的Nan,请问您对此有什么建议吗?不胜感激~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants