Most of this section is the same as RVT. We also heavily rely on a personal package nerv for utility functions.
Please use Anaconda for package management.
conda create -y -n leod python=3.9
conda activate leod
conda install -y pytorch==2.0.0 torchvision==0.15.0 torchaudio==2.0.0 pytorch-cuda=11.8 -c pytorch -c nvidia
python -m pip install tqdm numba hdf5plugin h5py==3.8.0 \
pandas==1.5.3 plotly==5.13.1 opencv-python==4.6.0.66 tabulate==0.9.0 \
pycocotools==2.0.6 bbox-visualizer==0.1.0 StrEnum==0.4.10 \
opencv-python hydra-core==1.3.2 einops==0.6.0 \
pytorch-lightning==1.8.6 wandb==0.14.0 torchdata==0.6.0
conda install -y blosc-hdf5-plugin -c conda-forge
# install nerv: https://github.com/Wuziyi616/nerv
git clone [email protected]:Wuziyi616/nerv.git
cd nerv
git checkout v0.4.0 # tested with v0.4.0 release
pip install -e .
cd .. # go back to the root directory of the project
# (Optional) compile Detectron2 for faster evaluation
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
We also provide a environment.yml
file which lists all the packages in our final environment.
I sometimes encounter a weird bug where Detectron2
cannot run on types of GPUs different from the one I compile it on (e.g., if I compile it on RTX6000 GPUs, I cannot use it on A40 GPUs).
To avoid this issue, go to coco_eval.py and set the compile_gpu
to the GPU you compile it (the program will not import Detectron2
when detecting a different GPUs in use).
In this project, we use two datasets: Gen1 and 1Mpx.
Following the convention of RVT, we name Gen1 as gen1
and 1Mpx as gen4
(because of the camera used to capture them).
Please download the pre-processed datasets from RVT:
1 Mpx | Gen1 | |
---|---|---|
pre-processed dataset | download | download |
crc32 | c5ec7c38 | 5acab6f3 |
After downloading and unzipping the datasets, soft link Gen1 to ./datasets/gen1
and 1Mpx to ./datasets/gen4
.
To simulate the weakly-/semi-supervised learning settings, we need to sub-sample labels from the original dataset. An important thing is that we need to keep the data split the same across experiments.
- For semi-supervised setting where we keep the labels for some sequences while making other sequences completely unlabeled, it is relatively easy. We just sort the name of event sequences so that their order will be deterministic across runs, and select unlabeled sequences from it.
- For weakly-supervised setting where we sub-sample the labels for all sequences, it is a bit tricky because there are two mode of data sampling in the codebase, and they pre-process events in different ways. To have a consistent data split, we create a split file for each setting, which are stored here. If you want to explore new experimental settings, remember to create your own split files and read from them here.
All results in the paper are averaged over three different splits (we offset the index when sub-sampling the data). Overall, the performance variations are very small across different splits. Therefore, we only release the split files, config files, and pre-trained weights for the first variant we experimented with.
We provide checkpoints for all the models used to produce the final performance in the paper.
In addition, we provide models pre-trained on the limited annotated data (the Supervised Baseline
method in the paper) to ease your experiments.
Please download the pre-trained weights from Google Drive and unzip them to ./pretrained/
.
The weights are grouped by the Section they are presented in the paper.
They naming follows the pattern rvt-{$MODEL_SIZE}-{$DATASET}x{$RATIO_OF_DATA}_{$SETTING}.ckpt
.
For example, rvt-s-gen1x0.02_ss.ckpt
is the RVT-S pre-trained on 2% of Gen1 data under the weakly-supervised setting.
rvt-s-gen4x0.05_ss-final.ckpt
is the RVT-S trained on 5% of 1Mpx data under the semi-supervised setting, and -final
means it is the LEOD self-trained model (used to produce the results in the paper).
Note: it might be a bit confusing, but ss
means weakly-supervised (all event sequences are sparsely labeled) and seq
means semi-supervised (some event sequences are densely labeled, while others are completely unlabeled).