-
Notifications
You must be signed in to change notification settings - Fork 269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train on own dataset #93
Comments
+1 |
I have been working on this for a while. After a huge hiccup I decided to start over and document it. So far I have been able to get pretext working with a custom dataset and the semantic clustering step is currently running. If you want I can share my document once everything is up and running. Hoping to have it done this weekend. @fernandorovai @showkeyjar |
come on! look forward to it! |
+1 |
The most important thing u need to do is to implement ur own dataset.py. You can paste a copy of like cifar10.py in the directory. Then edit configs/env.yml to set ouput folder. |
I see, but those example all have labels, I expect a example that completely no labels. |
It isn't a polished product but this is what I have: |
@hkhailee very good, and thanksfull, I'll have a try. |
I can't open the link |
Thats the correct path. If having issues try going directly to the search bar and looking for the GOLDEN_unsupervised repository under user hkhailee, the file is TrainingYourOwnDataset.md on the main branch. Or you can even click on a users name (such as mine) and view the repositories, and open the GOLDEN_unsupervised. Or you can even manually copy and paste the link in another browser window. |
@hkhailee The text of your link is correct whereas the embedded url is not (where people click and are directed to somewhere else). You can change it by "Add a link." |
The dataloader have target variable. SInce I am using unlabled data so I do not have label. How to handle that? I am keeping target = 0 to make code working. simclr and scan code is running but selflabel is throwing error "Mask in MaskedCrossEntropyLoss is all zeros". So how the model will work for complete unlabeled training set? |
I'm having a problem similar to that of @MotiBaadror . My dataset is very complex and has no labels. how do i cluster my data like this? |
@cilemafacan When I run the simclr.py for my dataset then this step would save npz file. If I open that file then the nearest neighbors are same for all the examples, did you encounter the same problem? |
I am creating a file for my own dataset, similar to the stl10 dataset file in the data folder. My question is that my data doesn't have any labels. I get all 2 by default instead of labels. This way I can start simclr training. When I examine the resulting .npy file, the nearest neighbors are different. The .npy file looks like this: array([[ 265, 2049, 109, 1353, 2028, 532, 395, 144, 2084, 1067, 942, I'm not sure I'm doing it right. What I don't understand is the contrastive_evaluate step on line 120 in simclr.py file. I don't understand exactly what is being done in this step. I'm getting an error at this step because my data is unlabeled. |
How did you guys set up your models and get_item() |
here is my get_item
This is how I am defining my dataset
I am running simclr.py for my dataset. I do not have label so I am seeting target =1 to make the code running |
Not sure since I only worked with moco.py but for mine: calling the dataset:
RICO20:
My model parameters for pretext:
I separated my data into 3 groups. Train, Test and Val. You do not have a Val (mine was only 1k images). Could it be over fitting your clusters? Also, the number of classes I have listed is 20 however, the classification after clustering them all visually together only found 19 unique clusters, past that there is replication of clustered images (kind of cool). |
what will be target in your get items? Do you have labels? |
In the repo you created for rico, you give the subset file path in the moco file. Similar to this, I need labels to create a subset file, but my dataset has no labels. Are you giving a label for the train data from your dataset in the get_item() section? I gave int 2 for the labels in the get_item() part that I created for myself, but I gave it only to run simclr. Actually, I don't have such a label.
|
The subset file in RICO20 is never used. I never call RICO20_sub which would use a subset file. There is a piece of code labeling all unlabeled images with 255 and that is their target. its in data/stl.py. When I was using moco I have 66k unlabeled images and 1k labeled images to test my results against.
With moco, the target is coming from the name the folder is in. My train, val and test images are all in sub folders. so train/1 and test/1, my 1k labeled images are also in subfolders with real names for example, val/bare, val/gallery etc. These names arent important. I could have set train/20, test/20 and had my validation folder be labeled 0-19 |
so you are using 1k enRICO dataset labeled for val dataset. Did I get right? |
As anyone created a better repository for training on unlabelled dataset?? |
Hi @hkhailee, @cilemafacan and @MotiBaadror, I got the same error (Mask in MaskedCrossEntropyLoss is all zeros) while trying to selflabel. Is there a possible way to selflabel without a need for a labeled validation dataset? Thank you in advance! |
For me (at the moment wiht labeled data), this database-class works: import sys, os class OwnDataset(Dataset):
|
For training with adding this for no classe names work: |
in case anyone comes here looking for a simple script to create an unsupervised visualization from a collection of images, i just published this: https://github.com/brunovianna/collectionview |
Hello, thanks for sharing this work. Is there instruction to train on our own dataset? Thanks
The text was updated successfully, but these errors were encountered: