Difficulty Reproducing Results for Grounded3DLLM #10

Samir55 · 2025-01-05T12:59:45Z

Dear authors,

Thank you for your excellent work and for making your resources available to the community!

I have been attempting to reproduce the results reported in your paper but have encountered some difficulties. I would greatly appreciate your guidance on a few points:

1. Evaluation of Stage 3 Checkpoint

I attempted to evaluate the provided Stage 3 checkpoint. However, the required scene features and detection folders were not included within the uploaded checkpoint, preventing me from running the evaluation. Could you provide instructions or the necessary resources to evaluate the Stage 3 checkpoint?

2. Retraining Stage 3 Model

I tried retraining the Stage 3 model using the provided Stage 1 and Stage 2 checkpoints but achieved results significantly lower than those reported in the paper, particularly on ScanRefer, Multi3DRef, Scan2Cap, and ScanNet200.

To retrain, I placed the original Stage 1 and Stage 2 checkpoints (from the repository) into the saved folder and ran the following command:

final_scripts/step3_train_grounded3dllm.sh

I conducted this training on a machine with 4 A100 80GB GPUs.

3. Evaluation Results

After running the evaluation script using:

final_scripts/eval_llm.sh ./saved/step3_mask3d_lang_4GPUS/01-03-14-19-41

I obtained the following results (summarized below for key metrics):
m3drefer official evaluator:
IoU 0.25: Overall = 12.4
IoU 0.50: Overall = 11.5
ScanRefer Evaluation:
Overall BBox Top-1 IoU 0.25: 0.1088
Overall BBox Top-1 IoU 0.50: 0.0955
Scan2Cap:
CIDEr: 0.308 (IoU 0.25), 0.258 (IoU 0.50)

Despite following the provided steps and ensuring a similar training environment, the outcomes are not aligning with the reported results.

Could you confirm if there are additional steps or configurations I might have missed? Alternatively, are there any known issues or dependencies that could affect the reproducibility?

Thank you for your time and assistance!

These are the complete reported results log upon running the following command:
bash final_scripts/eval_llm.sh ./saved/step3_mask3d_lang_4GPUS/01-03-14-19-41


================= m3drefer official evaluator ================= 
Evaluating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10855/10855 [00:01<00:00, 6196.95it/s]
====================================================================================================
IoU         zt_w_d      zt_wo_d     st_w_d      st_wo_d     mt          overall     
----------------------------------------------------------------------------------------------------
0.25        38.6        57.2        8.4         9.2         10.1        12.4        
0.50        38.6        57.2        7.4         7.6         9.3         11.5        
====================================================================================================

 =============================================================== 
evaluating 312 scans...
scans processed: 17
scans processed: 312

################################################################
what           :             AP         AP_50%         AP_25%
################################################################
chair          :          0.075          0.128          0.202
table          :          0.004          0.005          0.005
door           :          0.045          0.066          0.095
couch          :          0.031          0.079          0.223
cabinet        :          0.002          0.006          0.010
shelf          :          0.004          0.012          0.014
desk           :          0.000          0.000          0.000
office chair   :          0.030          0.042          0.053
bed            :          0.000          0.000          0.000
pillow         :          0.027          0.041          0.053
sink           :          0.086          0.153          0.282
picture        :          0.001          0.001          0.001
window         :          0.000          0.001          0.001
toilet         :          0.250          0.350          0.563
bookshelf      :          0.000          0.001          0.015
monitor        :          0.099          0.141          0.146
curtain        :          0.000          0.000          0.000
book           :          0.007          0.009          0.013
armchair       :          0.029          0.032          0.048
coffee table   :          0.003          0.006          0.006
box            :          0.005          0.011          0.017
refrigerator   :          0.000          0.000          0.000
lamp           :          0.000          0.000          0.000
kitchen cabinet:          0.010          0.020          0.055
towel          :          0.013          0.037          0.047
clothes        :          0.001          0.003          0.003
tv             :          0.000          0.000          0.000
nightstand     :          0.001          0.007          0.007
counter        :          0.000          0.000          0.000
dresser        :          0.003          0.005          0.020
stool          :          0.000          0.000          0.000
cushion        :          0.000          0.000          0.000
plant          :          0.000          0.000          0.000
ceiling        :          0.004          0.005          0.005
bathtub        :          0.093          0.203          0.268
end table      :          0.007          0.010          0.019
dining table   :          0.000          0.000          0.000
keyboard       :          0.020          0.020          0.020
bag            :          0.000          0.000          0.000
backpack       :          0.138          0.211          0.267
toilet paper   :          0.003          0.005          0.037
printer        :          0.000          0.000          0.000
tv stand       :          0.000          0.000          0.000
whiteboard     :          0.000          0.000          0.000
blanket        :          0.000          0.000          0.000
shower curtain :          0.000          0.000          0.000
trash can      :          0.139          0.200          0.267
closet         :          0.000          0.000          0.000
stairs         :          0.000          0.000          0.000
microwave      :          0.000          0.000          0.000
stove          :          0.000          0.000          0.000
shoe           :          0.013          0.013          0.025
computer tower :          0.156          0.187          0.199
bottle         :          0.000          0.000          0.000
bin            :          0.000          0.000          0.000
ottoman        :          0.000          0.000          0.000
bench          :          0.000          0.000          0.056
board          :          0.000          0.000          0.000
washing machine:          0.000          0.000          0.000
mirror         :          0.000          0.001          0.035
copier         :          0.000          0.000          0.000
basket         :          0.000          0.000          0.000
sofa chair     :          0.032          0.060          0.071
file cabinet   :          0.021          0.034          0.056
fan            :          0.000          0.000          0.000
laptop         :          0.000          0.000          0.000
shower         :          0.000          0.000          0.000
paper          :          0.000          0.000          0.000
person         :          0.000          0.000          0.000
paper towel dispenser:          0.000          0.000          0.000
oven           :          0.000          0.000          0.000
blinds         :          0.000          0.000          0.000
rack           :          0.000          0.000          0.000
plate          :          0.000          0.000          0.000
blackboard     :          0.000          0.000          0.000
piano          :          0.000          0.000          0.000
suitcase       :          0.067          0.080          0.080
rail           :          0.000          0.000          0.000
radiator       :          0.174          0.269          0.351
recycling bin  :          0.002          0.002          0.014
container      :          0.000          0.000          0.000
wardrobe       :          0.000          0.000          0.000
soap dispenser :          0.000          0.000          0.000
telephone      :          0.000          0.000          0.000
bucket         :          0.000          0.000          0.000
clock          :          0.000          0.000          0.000
stand          :          0.000          0.000          0.000
light          :          0.000          0.000          0.000
laundry basket :          0.000          0.000          0.000
pipe           :          0.000          0.000          0.000
clothes dryer  :          0.000          0.000          0.000
guitar         :          0.000          0.000          0.000
toilet paper holder:          0.000          0.000          0.000
seat           :          0.000          0.000          0.000
speaker        :          0.000          0.000          0.000
column         :          0.000          0.000          0.000
bicycle        :            nan            nan            nan
ladder         :          0.000          0.000          0.000
bathroom stall :          0.000          0.000          0.000
shower wall    :          0.000          0.000          0.000
cup            :          0.000          0.000          0.000
jacket         :          0.000          0.000          0.000
storage bin    :          0.000          0.000          0.000
coffee maker   :          0.000          0.000          0.000
dishwasher     :          0.000          0.000          0.000
paper towel roll:          0.000          0.000          0.000
machine        :          0.000          0.000          0.000
mat            :          0.000          0.000          0.000
windowsill     :          0.000          0.000          0.000
bar            :          0.000          0.000          0.000
toaster        :          0.000          0.000          0.000
bulletin board :          0.000          0.000          0.000
ironing board  :          0.000          0.000          0.000
fireplace      :          0.000          0.000          0.000
soap dish      :          0.000          0.000          0.000
kitchen counter:          0.000          0.000          0.000
doorframe      :          0.001          0.003          0.008
toilet paper dispenser:          0.000          0.000          0.000
mini fridge    :          0.000          0.000          0.000
fire extinguisher:          0.000          0.000          0.000
ball           :          0.000          0.000          0.000
hat            :          0.000          0.000          0.000
shower curtain rod:          0.000          0.000          0.000
water cooler   :          0.000          0.000          0.000
paper cutter   :          0.000          0.000          0.000
tray           :          0.000          0.000          0.000
shower door    :          0.000          0.000          0.000
pillar         :          0.000          0.000          0.000
ledge          :          0.000          0.000          0.000
toaster oven   :          0.000          0.000          0.000
mouse          :            nan            nan            nan
toilet seat cover dispenser:          0.000          0.000          0.000
furniture      :          0.000          0.000          0.000
cart           :          0.000          0.000          0.000
storage container:            nan            nan            nan
scale          :          0.000          0.000          0.000
tissue box     :          0.000          0.000          0.000
light switch   :            nan            nan            nan
crate          :          0.000          0.000          0.000
power outlet   :          0.000          0.000          0.000
decoration     :          0.000          0.000          0.000
sign           :          0.000          0.000          0.000
projector      :          0.000          0.000          0.000
closet door    :          0.015          0.094          0.137
vacuum cleaner :          0.000          0.000          0.000
candle         :            nan            nan            nan
plunger        :          0.000          0.000          0.000
stuffed animal :          0.000          0.000          0.000
headphones     :          0.000          0.000          0.000
dish rack      :          0.000          0.000          0.000
broom          :          0.000          0.000          0.000
guitar case    :            nan            nan            nan
range hood     :          0.000          0.000          0.000
dustpan        :          0.000          0.000          0.000
hair dryer     :          0.000          0.000          0.000
water bottle   :          0.000          0.000          0.000
handicap bar   :          0.000          0.000          0.000
purse          :            nan            nan            nan
vent           :          0.000          0.000          0.000
shower floor   :          0.000          0.000          0.000
water pitcher  :          0.000          0.000          0.000
mailbox        :          0.000          0.000          0.000
bowl           :          0.000          0.000          0.000
paper bag      :          0.000          0.000          0.000
alarm clock    :            nan            nan            nan
music stand    :            nan            nan            nan
projector screen:          0.000          0.000          0.000
divider        :          0.000          0.000          0.000
laundry detergent:          0.000          0.000          0.000
bathroom counter:          0.000          0.000          0.000
object         :          0.000          0.000          0.000
bathroom vanity:          0.000          0.000          0.042
closet wall    :          0.000          0.000          0.000
laundry hamper :          0.000          0.000          0.000
bathroom stall door:          0.008          0.008          0.008
ceiling light  :          0.000          0.000          0.000
trash bin      :          0.021          0.021          0.021
dumbbell       :          0.000          0.000          0.000
stair rail     :          0.000          0.000          0.000
tube           :          0.000          0.000          0.000
bathroom cabinet:          0.000          0.000          0.042
cd case        :            nan            nan            nan
closet rod     :          0.000          0.000          0.000
coffee kettle  :          0.000          0.000          0.000
structure      :            nan            nan            nan
shower head    :          0.000          0.000          0.000
keyboard piano :          0.000          0.000          0.000
case of water bottles:          0.000          0.000          0.000
coat rack      :          0.000          0.000          0.000
storage organizer:            nan            nan            nan
folded chair   :          0.000          0.000          0.000
fire alarm     :            nan            nan            nan
power strip    :            nan            nan            nan
calendar       :          0.000          0.000          0.000
poster         :          0.000          0.000          0.000
potted plant   :          0.000          0.000          0.000
luggage        :            nan            nan            nan
mattress       :          0.000          0.000          0.000
----------------------------------------------------------------
average        :          0.009          0.014          0.021

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 626/626 [00:01<00:00, 487.28it/s]
 ======================== scanrefer =========================
 ==== scanrefer unique bbox top1 0.5 ====
0.13495934959349593
 ==== scanrefer unique bbox top1 0.25 ====
0.15501355013550136
 ==== scanrefer multi bbox top1 0.5 ====
0.08567188343227199
 ==== scanrefer multi bbox top1 0.25 ====
0.09727468969239071
 ==== scanrefer overall bbox top1 0.5 ====
0.09549530085340824
 ==== scanrefer overall bbox top1 0.25 ====
0.10878254294047747
 ==== scanrefer unique mask top1 0.5 ====
0.12953929539295392
 ==== scanrefer unique mask top1 0.25 ====
0.15013550135501355
 ==== scanrefer multi mask top1 0.5 ====
0.08472746896923908
 ==== scanrefer multi mask top1 0.25 ====
0.10037776578521317
 ==== scanrefer overall mask top1 0.5 ====
0.09365885276007346
 ==== scanrefer overall mask top1 0.25 ====
0.11029491195851787
 =============================================================
 ========================= m3drefer ==========================
 each type amount: st_wo_d 2059 st_w_d 5169 mt 2721 zt_wo_d 528 zt_w_d 378
==== m3dref st_wo_d_50/25 ====
iou 50 0.07479752262982373
iou 25 0.09051929490233444
==== m3dref st_w_d_50/25 ====
iou 50 0.07225975975975976
iou 25 0.08183183183183183
==== m3dref mt_50/25 ====
iou 50 0.09184740060144865
iou 25 0.09944050873862648
==== m3dref zt_wo_d_50/25 ====
iou 50 0.4810606060606061
iou 25 0.4810606060606061
 threshold 0.15
iou 50 0.803030303030303
iou 25 0.803030303030303
==== m3dref zt_w_d_50/25 ====
iou 50 0.2671957671957672
iou 25 0.2671957671957672
 threshold 0.15
iou 50 0.5555555555555556
iou 25 0.5555555555555556
==== m3dref all ====
iou 50 0.10371715811164883
iou 25 0.11317921393980022
 =============================================================
 ============================= statistics ==============================
 ==== bbox f1 0.25 ====
length of query:61776
detection: 0.8528181239402511
length of query:1845
scanrefer:unique: 0.15501355013550136
length of query:7412
scanrefer:multiple: 0.09727468969239071
length of query:9257
scanrefer:overall: 0.10878254294047747
length of query:10855
m3dref:overall: 0.11562943183716186
 ==== bbox f1 0.50 ====
length of query:61776
detection: 0.8513338225833148
length of query:1845
scanrefer:unique: 0.13495934959349593
length of query:7412
scanrefer:multiple: 0.08567188343227199
length of query:9257
scanrefer:overall: 0.09549530085340824
length of query:10855
m3dref:overall: 0.1059625318708606
precision if grounding: 0.2607723940640086, recall if grounding: 0.6757007180912671
gt probability of det 0.06988150738150738
pred probability of det 0.18107355607355607
f1 0.843482905982906
 =========================================================== 
 ============================================================= 
 ========================= scan 2 cap ======================== 
scan2cap test length: 2068
 ================= scan2cap 0.25 ====================== 
PTBTokenizer tokenized 219199 tokens at 1221691.11 tokens per second.
PTBTokenizer tokenized 19358 tokens at 198907.33 tokens per second.
{'testlen': 15499, 'reflen': 30668, 'guess': [15499, 13431, 11363, 10467], 'correct': [13045, 6275, 3611, 2102]}
ratio: 0.5053802008608157
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.nustaq.serialization.FSTClazzInfo (file:/home/abdeas0a/miniconda3/envs/grounded3dllm/lib/python3.9/site-packages/pycocoevalcap/spice/lib/fst-2.47.jar) to field java.lang.String.value
WARNING: Please consider reporting this to the maintainers of org.nustaq.serialization.FSTClazzInfo
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Parsing reference captions
Parsing test captions
Warning: Nashorn engine is planned to be removed from a future JDK release
SPICE evaluation took: 2.511 s
Bleu_1:      0.3162953370710233
Bleu_2:      0.23565425041804586
Bleu_3:      0.18787938948550467
Bleu_4:      0.1495719215793791
METEOR:      0.1828888502388118
ROUGE_L:      0.39038117609351
CIDEr:      0.3081693452506969
SPICE:      0.11733765156274882
 ================= scan2cap 0.50 ====================== 
PTBTokenizer tokenized 219199 tokens at 1168137.52 tokens per second.
PTBTokenizer tokenized 17020 tokens at 187201.50 tokens per second.
{'testlen': 13483, 'reflen': 30490, 'guess': [13483, 11415, 9347, 8612], 'correct': [11491, 5250, 3056, 1779]}
ratio: 0.442210560839605
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.nustaq.serialization.FSTClazzInfo (file:/home/abdeas0a/miniconda3/envs/grounded3dllm/lib/python3.9/site-packages/pycocoevalcap/spice/lib/fst-2.47.jar) to field java.lang.String.value
WARNING: Please consider reporting this to the maintainers of org.nustaq.serialization.FSTClazzInfo
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Parsing reference captions
Parsing test captions
Warning: Nashorn engine is planned to be removed from a future JDK release
SPICE evaluation took: 2.493 s
Bleu_1:      0.24141648652743028
Bleu_2:      0.17734664369806574
Bleu_3:      0.1428151334215214
Bleu_4:      0.11426087957894697
METEOR:      0.16977839021228439
ROUGE_L:      0.36241497535772377
CIDEr:      0.25847976870749806
SPICE:      0.11039217099177853
 ========================= scan qa ===========================
scanqa val length: 4675
PTBTokenizer tokenized 32429 tokens at 329347.22 tokens per second.
PTBTokenizer tokenized 13763 tokens at 148538.16 tokens per second.
{'testlen': 9089, 'reflen': 9445, 'guess': [9089, 4414, 2230, 1019], 'correct': [3772, 804, 221, 40]}
ratio: 0.9623080995234555
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.nustaq.serialization.FSTClazzInfo (file:/home/abdeas0a/miniconda3/envs/grounded3dllm/lib/python3.9/site-packages/pycocoevalcap/spice/lib/fst-2.47.jar) to field java.lang.String.value
WARNING: Please consider reporting this to the maintainers of org.nustaq.serialization.FSTClazzInfo
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Parsing reference captions
Parsing test captions
Warning: Nashorn engine is planned to be removed from a future JDK release
SPICE evaluation took: 1.607 s
Bleu_1:      0.3990662835543671
Bleu_2:      0.2643803091930449
Bleu_3:      0.18815323639005468
Bleu_4:      0.12592228200248906
METEOR:      0.1576597485286668
ROUGE_L:      0.37902716785220414
CIDEr:      0.7609764049226607
SPICE:      0.16850612447215152
EM:         0.20663101604278075
refined EM: 0.34994652406417115
 ============================================================= 
 ======================= obj description ===================== 
objdesc test length: 7912
PTBTokenizer tokenized 100186 tokens at 593109.05 tokens per second.
PTBTokenizer tokenized 85376 tokens at 605087.19 tokens per second.
{'testlen': 67867, 'reflen': 80634, 'guess': [67867, 59955, 52043, 44131], 'correct': [26900, 9370, 3853, 1839]}
ratio: 0.841667286752476
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.nustaq.serialization.FSTClazzInfo (file:/home/abdeas0a/miniconda3/envs/grounded3dllm/lib/python3.9/site-packages/pycocoevalcap/spice/lib/fst-2.47.jar) to field java.lang.String.value
WARNING: Please consider reporting this to the maintainers of org.nustaq.serialization.FSTClazzInfo
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Parsing reference captions
Parsing test captions
Warning: Nashorn engine is planned to be removed from a future JDK release
SPICE evaluation took: 3.113 s
Bleu_1:      0.3283938733846164
Bleu_2:      0.2062079116470805
Bleu_3:      0.13765212407797042
Bleu_4:      0.09741414714857334
METEOR:      0.12246051649562131
ROUGE_L:      0.33734877603871993
CIDEr:      0.6474211180138565
SPICE:      0.12877115203435519
 ======================= dialog description ===================== 
dialog test length: 770
PTBTokenizer tokenized 18097 tokens at 179988.98 tokens per second.
PTBTokenizer tokenized 14303 tokens at 145476.88 tokens per second.
{'testlen': 11983, 'reflen': 15572, 'guess': [11983, 11213, 10443, 9673], 'correct': [5521, 2371, 1153, 594]}
ratio: 0.7695222193680471
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.nustaq.serialization.FSTClazzInfo (file:/home/abdeas0a/miniconda3/envs/grounded3dllm/lib/python3.9/site-packages/pycocoevalcap/spice/lib/fst-2.47.jar) to field java.lang.String.value
WARNING: Please consider reporting this to the maintainers of org.nustaq.serialization.FSTClazzInfo
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Parsing reference captions
Parsing test captions
Warning: Nashorn engine is planned to be removed from a future JDK release
SPICE evaluation took: 1.072 s
Bleu_1:      0.34148975131370496
Bleu_2:      0.23134304431570302
Bleu_3:      0.163611616720436
Bleu_4:      0.11882239122081928
METEOR:      0.15386677920043323
ROUGE_L:      0.38497540743245096
CIDEr:      1.0050675869898922
SPICE:      0.22733485287560512
 ======================= planning description ===================== 
planning test length: 39
PTBTokenizer tokenized 1772 tokens at 29131.57 tokens per second.
PTBTokenizer tokenized 1913 tokens at 35543.62 tokens per second.
{'testlen': 1659, 'reflen': 1553, 'guess': [1659, 1620, 1581, 1542], 'correct': [610, 241, 113, 66]}
ratio: 1.0682549903405871
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.nustaq.serialization.FSTClazzInfo (file:/home/abdeas0a/miniconda3/envs/grounded3dllm/lib/python3.9/site-packages/pycocoevalcap/spice/lib/fst-2.47.jar) to field java.lang.String.value
WARNING: Please consider reporting this to the maintainers of org.nustaq.serialization.FSTClazzInfo
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Parsing reference captions
Parsing test captions
Warning: Nashorn engine is planned to be removed from a future JDK release
SPICE evaluation took: 756.2 ms
Bleu_1:      0.3676913803493866
Bleu_2:      0.23387981331580165
Bleu_3:      0.15753510624625758
Bleu_4:      0.11373598760958994
METEOR:      0.17724084087858002
ROUGE_L:      0.3246052525887195
CIDEr:      0.6054020345146574
SPICE:      0.12506767762822837

The text was updated successfully, but these errors were encountered:

chenyilun95 · 2025-01-05T15:05:58Z

Thanks for your interest!

Evaluation of the checkpoints requires no scene features (only required for the Gradio demo and can be obtained after running the inference script). It looks like the inference script is missing in step 3 scripts. I have supplemented the inference script into the step-3 training script (just changed train_mode=False). You can run the inference command first of step 3 and then evaluate the results.

2/3. The training should ensure that the pre-trained checkpoints (the path of "general.checkpoint" in the script) are correct. Your results show that the pre-trained clasp detector (step 2) is ~0, which must be incorrect. I am not sure which part of the loading process is wrong; perhaps you can test the results of step 2 (CLASP detector) and ensure the pre-trained detector is correct.

Let me know more information during reproduction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difficulty Reproducing Results for Grounded3DLLM #10

Difficulty Reproducing Results for Grounded3DLLM #10

Samir55 commented Jan 5, 2025

chenyilun95 commented Jan 5, 2025 •

edited

Loading

Difficulty Reproducing Results for Grounded3DLLM #10

Difficulty Reproducing Results for Grounded3DLLM #10

Comments

Samir55 commented Jan 5, 2025

1. Evaluation of Stage 3 Checkpoint

2. Retraining Stage 3 Model

3. Evaluation Results

chenyilun95 commented Jan 5, 2025 • edited Loading

chenyilun95 commented Jan 5, 2025 •

edited

Loading