Train on own dataset #93

fernandorovai · 2021-10-24T14:24:35Z

Hello, thanks for sharing this work. Is there instruction to train on our own dataset? Thanks

showkeyjar · 2021-11-04T09:53:04Z

+1

hkhailee · 2021-11-05T20:00:11Z

I have been working on this for a while. After a huge hiccup I decided to start over and document it. So far I have been able to get pretext working with a custom dataset and the semantic clustering step is currently running. If you want I can share my document once everything is up and running. Hoping to have it done this weekend. @fernandorovai @showkeyjar

showkeyjar · 2021-11-08T03:26:28Z

come on! look forward to it!

brunovianna · 2021-11-10T10:27:17Z

+1

zhaotf16 · 2021-11-15T05:26:46Z

The most important thing u need to do is to implement ur own dataset.py. You can paste a copy of like cifar10.py in the directory. Then edit configs/env.yml to set ouput folder.

showkeyjar · 2021-11-16T09:25:05Z

The most important thing u need to do is to implement ur own dataset.py. You can paste a copy of like cifar10.py in the directory. Then edit configs/env.yml to set ouput folder.

I see, but those example all have labels, I expect a example that completely no labels.

hkhailee · 2021-11-19T03:44:12Z

It isn't a polished product but this is what I have:
https://github.com/hkhailee/GOLDEN_unsupervised/blob/main/TrainingYourOwnDataset.md

showkeyjar · 2021-11-19T06:24:35Z

@hkhailee very good, and thanksfull, I'll have a try.

ravit-cohen-segev · 2021-11-23T10:50:21Z

It isn't a polished product but this is what I have: https://github.com/hkhailee/GOLDEN_unsupervised/blob/main/TrainingYourOwnDataset.md

I can't open the link

hkhailee · 2021-11-24T21:28:55Z

It isn't a polished product but this is what I have: https://github.com/hkhailee/GOLDEN_unsupervised/blob/main/TrainingYourOwnDataset.md

I can't open the link

Thats the correct path. If having issues try going directly to the search bar and looking for the GOLDEN_unsupervised repository under user hkhailee, the file is TrainingYourOwnDataset.md on the main branch.

Or you can even click on a users name (such as mine) and view the repositories, and open the GOLDEN_unsupervised. Or you can even manually copy and paste the link in another browser window.

wetliu · 2021-12-17T00:45:37Z

@hkhailee The text of your link is correct whereas the embedded url is not (where people click and are directed to somewhere else). You can change it by "Add a link."

MotiBaadror · 2021-12-23T16:48:23Z

The dataloader have target variable. SInce I am using unlabled data so I do not have label. How to handle that? I am keeping target = 0 to make code working. simclr and scan code is running but selflabel is throwing error "Mask in MaskedCrossEntropyLoss is all zeros". So how the model will work for complete unlabeled training set?

cilemafacan · 2021-12-28T14:08:40Z

I'm having a problem similar to that of @MotiBaadror . My dataset is very complex and has no labels. how do i cluster my data like this?

MotiBaadror · 2021-12-28T14:20:03Z

@cilemafacan When I run the simclr.py for my dataset then this step would save npz file. If I open that file then the nearest neighbors are same for all the examples, did you encounter the same problem?

cilemafacan · 2021-12-28T14:37:21Z

@cilemafacan When I run the simclr.py for my dataset then this step would save npz file. If I open that file then the nearest neighbors are same for all the examples, did you encounter the same problem?

I am creating a file for my own dataset, similar to the stl10 dataset file in the data folder. My question is that my data doesn't have any labels. I get all 2 by default instead of labels. This way I can start simclr training. When I examine the resulting .npy file, the nearest neighbors are different. The .npy file looks like this:

array([[ 265, 2049, 109, 1353, 2028, 532, 395, 144, 2084, 1067, 942,
1343, 830, 1054, 2191, 189, 1239, 1738, 501, 123, 619],
[ 144, 1414, 1428, 1310, 1064, 1954, 424, 95, 1520, 334, 2145,
1641, 323, 1670, 1543, 538, 920, 1180, 1540, 2050, 1814],
[ 145, 279, 921, 1939, 179, 713, 861, 720, 1489, 1005, 1283,
1170, 413, 405, 260, 273, 2305, 2198, 1564, 1818, 289],
[1604, 259, 1300, 532, 1680, 1817, 2184, 1428, 1576, 315, 174,
1983, 1128, 1753, 1733, 40, 893, 889, 748, 1255, 2046]....

I'm not sure I'm doing it right. What I don't understand is the contrastive_evaluate step on line 120 in simclr.py file. I don't understand exactly what is being done in this step. I'm getting an error at this step because my data is unlabeled.

hkhailee · 2021-12-28T19:31:57Z

How did you guys set up your models and get_item()

MotiBaadror · 2021-12-28T19:53:36Z

here is my get_item


 def __getitem__(self, index):
        # sample = self.dataset.__getitem__(index)
        # image = sample['image']

        sample ={}
        image = self.dataset.__getitem__(index)[0]
        sample['target'] = 1
        
        sample['image'] = self.image_transform(image)
        sample['image_augmented'] = self.augmentation_transform(image)

        return sample

This is how I am defining my dataset

dataset = torchvision.datasets.ImageFolder('data/my_data',transform=transform)

I am running simclr.py for my dataset. I do not have label so I am seeting target =1 to make the code running

hkhailee · 2021-12-28T20:16:54Z

Not sure since I only worked with moco.py but for mine:

calling the dataset:

elif p['train_db_name'] == 'rico-20':
        from data.rico import RICO20
        subset_file = ''
        dataset = RICO20(subset_file=subset_file, split='train', transform=transform)

RICO20:


class RICO20(datasets.ImageFolder): 
    def __init__(self, subset_file, root=MyPath.db_root_dir('rico-20'), split='train', transform=None):
        super(RICO20, self).__init__(root=os.path.join(root, '%s/' %(split)),
                                         transform=None)
        self.transform = transform 
        self.split = split
        self.resize = tf.Resize(256)
    def __len__(self):
        return len(self.imgs)


    def __getitem__(self, index):
        path, target = self.imgs[index]
        with open(path, 'rb') as f:
            img = Image.open(f).convert('RGB')
        im_size = img.size
        img = self.resize(img)

        if self.transform is not None:
            img = self.transform(img)

        out = {'image': img, 'target': target, 'meta': {'im_size': im_size, 'index': index, 'path':path}}

        return out

    def get_image(self, index):
        path, target = self.imgs[index]
        with open(path, 'rb') as f:
            img = Image.open(f).convert('RGB')
        img = self.resize(img) 
        return img

My model parameters for pretext:


setup: moco # MoCo is used here

backbone: resnet50
model_kwargs:
   head: mlp
   features_dim: 128

train_db_name: rico-20
val_db_name: rico-20
num_classes: 20
temperature: 0.07

batch_size: 128 
num_workers: 8

transformation_kwargs:
   crop_size: 224
   normalize:
      mean: [0.485, 0.456, 0.406]
      std: [0.229, 0.224, 0.225]

I separated my data into 3 groups. Train, Test and Val. You do not have a Val (mine was only 1k images). Could it be over fitting your clusters?

Also, the number of classes I have listed is 20 however, the classification after clustering them all visually together only found 19 unique clusters, past that there is replication of clustered images (kind of cool).

MotiBaadror · 2021-12-28T20:30:41Z

what will be target in your get items? Do you have labels?

cilemafacan · 2021-12-29T07:37:19Z

In the repo you created for rico, you give the subset file path in the moco file. Similar to this, I need labels to create a subset file, but my dataset has no labels. Are you giving a label for the train data from your dataset in the get_item() section? I gave int 2 for the labels in the get_item() part that I created for myself, but I gave it only to run simclr. Actually, I don't have such a label.

elif p['train_db_name'] == 'rico-20':
        from data.rico import RICO20
        subset_file ='/bsuhome/hkiesecker/scratch/imageClassification/GOLDEN/UnsupervisedClassification/data/rico_subsets/%s.txt' %(p['train_db_name'])
        dataset = RICO20(subset_file=subset_file, split='train', transform=transform)

hkhailee · 2021-12-29T16:41:13Z

The subset file in RICO20 is never used. I never call RICO20_sub which would use a subset file. There is a piece of code labeling all unlabeled images with 255 and that is their target. its in data/stl.py. When I was using moco I have 66k unlabeled images and 1k labeled images to test my results against.


        if self.labels is not None:
            img, target = self.data[index], int(self.labels[index])
            class_name = self.classes[target]
        else:
            img, target = self.data[index], 255 # 255 is an ignore index
            class_name = 'unlabeled'

With moco, the target is coming from the name the folder is in. My train, val and test images are all in sub folders. so train/1 and test/1, my 1k labeled images are also in subfolders with real names for example, val/bare, val/gallery etc. These names arent important. I could have set train/20, test/20 and had my validation folder be labeled 0-19

cilemafacan · 2021-12-30T11:23:26Z

so you are using 1k enRICO dataset labeled for val dataset. Did I get right?

Arunxarvio · 2022-04-07T12:38:17Z

As anyone created a better repository for training on unlabelled dataset??

mzacri · 2022-04-08T11:20:45Z

Hi @hkhailee, @cilemafacan and @MotiBaadror,

I got the same error (Mask in MaskedCrossEntropyLoss is all zeros) while trying to selflabel. Is there a possible way to selflabel without a need for a labeled validation dataset?

Thank you in advance!

catweis · 2022-04-26T12:33:40Z

For me (at the moment wiht labeled data), this database-class works:

import sys, os
from PIL import Image
import cv2
from torch.utils.data import Dataset
sys.path.append(os.getcwd())

class OwnDataset(Dataset):

def __init__(self, img_paths, transform=None, class_names = ['bla'], im_size = 128):
    self.img_paths = img_paths
    self.transform = transform
    self.class_names = class_names
    self.im_size = im_size

def __len__(self):
    return len(self.img_paths)

def __getitem__(self, idx):

    img_filepath = self.img_paths[idx]
    img = cv2.imread(img_filepath)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    img = cv2.resize(img, (self.im_size, self.im_size))
    img = Image.fromarray(img)
    if self.transform is not None:
        img = self.transform(img)

    class_name = os.path.basename(os.path.dirname(img_filepath))
    target = self.class_names.index(class_name)

    out = {'image': img, 'target': target, 'meta': {'im_size': self.im_size,  'class_name': self.class_names}}

    return out

def get_image(self, idx):
    img_filepath = self.img_paths[idx]
    img = cv2.imread(img_filepath)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    img = cv2.resize(img, (self.im_size, self.im_size))
    img = Image.fromarray(img)
    if self.transform is not None:
        img = self.transform(img)
    return img

catweis · 2022-04-26T14:09:54Z

For training with adding this for no classe names work:
if self.class_names is not None:
class_name = os.path.basename(os.path.dirname(img_filepath))
target = self.class_names.index(class_name)
else:
target = 0
class_name = 'No class'
However, model-evaluation without labels is still an issue...

brunovianna · 2022-06-01T12:32:21Z

in case anyone comes here looking for a simple script to create an unsupervised visualization from a collection of images, i just published this: https://github.com/brunovianna/collectionview

fernandorovai changed the title ~~Train on own datase~~ Train on own dataset Oct 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train on own dataset #93

Train on own dataset #93

fernandorovai commented Oct 24, 2021

showkeyjar commented Nov 4, 2021

hkhailee commented Nov 5, 2021

showkeyjar commented Nov 8, 2021

brunovianna commented Nov 10, 2021

zhaotf16 commented Nov 15, 2021

showkeyjar commented Nov 16, 2021

hkhailee commented Nov 19, 2021 •

edited

Loading

showkeyjar commented Nov 19, 2021

ravit-cohen-segev commented Nov 23, 2021

hkhailee commented Nov 24, 2021 •

edited

Loading

wetliu commented Dec 17, 2021

MotiBaadror commented Dec 23, 2021

cilemafacan commented Dec 28, 2021

MotiBaadror commented Dec 28, 2021

cilemafacan commented Dec 28, 2021

hkhailee commented Dec 28, 2021

MotiBaadror commented Dec 28, 2021 •

edited

Loading

hkhailee commented Dec 28, 2021 •

edited

Loading

MotiBaadror commented Dec 28, 2021

cilemafacan commented Dec 29, 2021

hkhailee commented Dec 29, 2021 •

edited

Loading

cilemafacan commented Dec 30, 2021

Arunxarvio commented Apr 7, 2022

mzacri commented Apr 8, 2022

catweis commented Apr 26, 2022

catweis commented Apr 26, 2022

brunovianna commented Jun 1, 2022

Train on own dataset #93

Train on own dataset #93

Comments

fernandorovai commented Oct 24, 2021

showkeyjar commented Nov 4, 2021

hkhailee commented Nov 5, 2021

showkeyjar commented Nov 8, 2021

brunovianna commented Nov 10, 2021

zhaotf16 commented Nov 15, 2021

showkeyjar commented Nov 16, 2021

hkhailee commented Nov 19, 2021 • edited Loading

showkeyjar commented Nov 19, 2021

ravit-cohen-segev commented Nov 23, 2021

hkhailee commented Nov 24, 2021 • edited Loading

wetliu commented Dec 17, 2021

MotiBaadror commented Dec 23, 2021

cilemafacan commented Dec 28, 2021

MotiBaadror commented Dec 28, 2021

cilemafacan commented Dec 28, 2021

hkhailee commented Dec 28, 2021

MotiBaadror commented Dec 28, 2021 • edited Loading

hkhailee commented Dec 28, 2021 • edited Loading

MotiBaadror commented Dec 28, 2021

cilemafacan commented Dec 29, 2021

hkhailee commented Dec 29, 2021 • edited Loading

cilemafacan commented Dec 30, 2021

Arunxarvio commented Apr 7, 2022

mzacri commented Apr 8, 2022

catweis commented Apr 26, 2022

catweis commented Apr 26, 2022

brunovianna commented Jun 1, 2022

hkhailee commented Nov 19, 2021 •

edited

Loading

hkhailee commented Nov 24, 2021 •

edited

Loading

MotiBaadror commented Dec 28, 2021 •

edited

Loading

hkhailee commented Dec 28, 2021 •

edited

Loading

hkhailee commented Dec 29, 2021 •

edited

Loading