Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train on own dataset #93

Open
fernandorovai opened this issue Oct 24, 2021 · 27 comments
Open

Train on own dataset #93

fernandorovai opened this issue Oct 24, 2021 · 27 comments

Comments

@fernandorovai
Copy link

Hello, thanks for sharing this work. Is there instruction to train on our own dataset? Thanks

@fernandorovai fernandorovai changed the title Train on own datase Train on own dataset Oct 24, 2021
@showkeyjar
Copy link

+1

@hkhailee
Copy link

hkhailee commented Nov 5, 2021

I have been working on this for a while. After a huge hiccup I decided to start over and document it. So far I have been able to get pretext working with a custom dataset and the semantic clustering step is currently running. If you want I can share my document once everything is up and running. Hoping to have it done this weekend. @fernandorovai @showkeyjar

@showkeyjar
Copy link

come on! look forward to it!

@brunovianna
Copy link

+1

@zhaotf16
Copy link

The most important thing u need to do is to implement ur own dataset.py. You can paste a copy of like cifar10.py in the directory. Then edit configs/env.yml to set ouput folder.

@showkeyjar
Copy link

The most important thing u need to do is to implement ur own dataset.py. You can paste a copy of like cifar10.py in the directory. Then edit configs/env.yml to set ouput folder.

I see, but those example all have labels, I expect a example that completely no labels.

@hkhailee
Copy link

hkhailee commented Nov 19, 2021

It isn't a polished product but this is what I have:
https://github.com/hkhailee/GOLDEN_unsupervised/blob/main/TrainingYourOwnDataset.md

@showkeyjar
Copy link

@hkhailee very good, and thanksfull, I'll have a try.

@ravit-cohen-segev
Copy link

It isn't a polished product but this is what I have: https://github.com/hkhailee/GOLDEN_unsupervised/blob/main/TrainingYourOwnDataset.md

I can't open the link

@hkhailee
Copy link

hkhailee commented Nov 24, 2021

It isn't a polished product but this is what I have: https://github.com/hkhailee/GOLDEN_unsupervised/blob/main/TrainingYourOwnDataset.md

I can't open the link

Thats the correct path. If having issues try going directly to the search bar and looking for the GOLDEN_unsupervised repository under user hkhailee, the file is TrainingYourOwnDataset.md on the main branch.

Or you can even click on a users name (such as mine) and view the repositories, and open the GOLDEN_unsupervised. Or you can even manually copy and paste the link in another browser window.

@wetliu
Copy link

wetliu commented Dec 17, 2021

@hkhailee The text of your link is correct whereas the embedded url is not (where people click and are directed to somewhere else). You can change it by "Add a link."

@MotiBaadror
Copy link

The dataloader have target variable. SInce I am using unlabled data so I do not have label. How to handle that? I am keeping target = 0 to make code working. simclr and scan code is running but selflabel is throwing error "Mask in MaskedCrossEntropyLoss is all zeros". So how the model will work for complete unlabeled training set?

@cilemafacan
Copy link

I'm having a problem similar to that of @MotiBaadror . My dataset is very complex and has no labels. how do i cluster my data like this?

@MotiBaadror
Copy link

@cilemafacan When I run the simclr.py for my dataset then this step would save npz file. If I open that file then the nearest neighbors are same for all the examples, did you encounter the same problem?

@cilemafacan
Copy link

@cilemafacan When I run the simclr.py for my dataset then this step would save npz file. If I open that file then the nearest neighbors are same for all the examples, did you encounter the same problem?

I am creating a file for my own dataset, similar to the stl10 dataset file in the data folder. My question is that my data doesn't have any labels. I get all 2 by default instead of labels. This way I can start simclr training. When I examine the resulting .npy file, the nearest neighbors are different. The .npy file looks like this:

array([[ 265, 2049, 109, 1353, 2028, 532, 395, 144, 2084, 1067, 942,
1343, 830, 1054, 2191, 189, 1239, 1738, 501, 123, 619],
[ 144, 1414, 1428, 1310, 1064, 1954, 424, 95, 1520, 334, 2145,
1641, 323, 1670, 1543, 538, 920, 1180, 1540, 2050, 1814],
[ 145, 279, 921, 1939, 179, 713, 861, 720, 1489, 1005, 1283,
1170, 413, 405, 260, 273, 2305, 2198, 1564, 1818, 289],
[1604, 259, 1300, 532, 1680, 1817, 2184, 1428, 1576, 315, 174,
1983, 1128, 1753, 1733, 40, 893, 889, 748, 1255, 2046]....

I'm not sure I'm doing it right. What I don't understand is the contrastive_evaluate step on line 120 in simclr.py file. I don't understand exactly what is being done in this step. I'm getting an error at this step because my data is unlabeled.

@hkhailee
Copy link

How did you guys set up your models and get_item()

@MotiBaadror
Copy link

MotiBaadror commented Dec 28, 2021

here is my get_item


 def __getitem__(self, index):
        # sample = self.dataset.__getitem__(index)
        # image = sample['image']

        sample ={}
        image = self.dataset.__getitem__(index)[0]
        sample['target'] = 1
        
        sample['image'] = self.image_transform(image)
        sample['image_augmented'] = self.augmentation_transform(image)

        return sample

This is how I am defining my dataset

dataset = torchvision.datasets.ImageFolder('data/my_data',transform=transform)

I am running simclr.py for my dataset. I do not have label so I am seeting target =1 to make the code running

@hkhailee
Copy link

hkhailee commented Dec 28, 2021

Not sure since I only worked with moco.py but for mine:

calling the dataset:

elif p['train_db_name'] == 'rico-20':
        from data.rico import RICO20
        subset_file = ''
        dataset = RICO20(subset_file=subset_file, split='train', transform=transform)

RICO20:


class RICO20(datasets.ImageFolder): 
    def __init__(self, subset_file, root=MyPath.db_root_dir('rico-20'), split='train', transform=None):
        super(RICO20, self).__init__(root=os.path.join(root, '%s/' %(split)),
                                         transform=None)
        self.transform = transform 
        self.split = split
        self.resize = tf.Resize(256)
    def __len__(self):
        return len(self.imgs)


    def __getitem__(self, index):
        path, target = self.imgs[index]
        with open(path, 'rb') as f:
            img = Image.open(f).convert('RGB')
        im_size = img.size
        img = self.resize(img)

        if self.transform is not None:
            img = self.transform(img)

        out = {'image': img, 'target': target, 'meta': {'im_size': im_size, 'index': index, 'path':path}}

        return out

    def get_image(self, index):
        path, target = self.imgs[index]
        with open(path, 'rb') as f:
            img = Image.open(f).convert('RGB')
        img = self.resize(img) 
        return img

My model parameters for pretext:


setup: moco # MoCo is used here

backbone: resnet50
model_kwargs:
   head: mlp
   features_dim: 128

train_db_name: rico-20
val_db_name: rico-20
num_classes: 20
temperature: 0.07

batch_size: 128 
num_workers: 8

transformation_kwargs:
   crop_size: 224
   normalize:
      mean: [0.485, 0.456, 0.406]
      std: [0.229, 0.224, 0.225]

I separated my data into 3 groups. Train, Test and Val. You do not have a Val (mine was only 1k images). Could it be over fitting your clusters?

Also, the number of classes I have listed is 20 however, the classification after clustering them all visually together only found 19 unique clusters, past that there is replication of clustered images (kind of cool).

@MotiBaadror
Copy link

what will be target in your get items? Do you have labels?

@cilemafacan
Copy link

In the repo you created for rico, you give the subset file path in the moco file. Similar to this, I need labels to create a subset file, but my dataset has no labels. Are you giving a label for the train data from your dataset in the get_item() section? I gave int 2 for the labels in the get_item() part that I created for myself, but I gave it only to run simclr. Actually, I don't have such a label.

elif p['train_db_name'] == 'rico-20':
        from data.rico import RICO20
        subset_file ='/bsuhome/hkiesecker/scratch/imageClassification/GOLDEN/UnsupervisedClassification/data/rico_subsets/%s.txt' %(p['train_db_name'])
        dataset = RICO20(subset_file=subset_file, split='train', transform=transform)

@hkhailee
Copy link

hkhailee commented Dec 29, 2021

The subset file in RICO20 is never used. I never call RICO20_sub which would use a subset file. There is a piece of code labeling all unlabeled images with 255 and that is their target. its in data/stl.py. When I was using moco I have 66k unlabeled images and 1k labeled images to test my results against.


        if self.labels is not None:
            img, target = self.data[index], int(self.labels[index])
            class_name = self.classes[target]
        else:
            img, target = self.data[index], 255 # 255 is an ignore index
            class_name = 'unlabeled'

With moco, the target is coming from the name the folder is in. My train, val and test images are all in sub folders. so train/1 and test/1, my 1k labeled images are also in subfolders with real names for example, val/bare, val/gallery etc. These names arent important. I could have set train/20, test/20 and had my validation folder be labeled 0-19

@cilemafacan
Copy link

so you are using 1k enRICO dataset labeled for val dataset. Did I get right?

@Arunxarvio
Copy link

As anyone created a better repository for training on unlabelled dataset??

@mzacri
Copy link

mzacri commented Apr 8, 2022

Hi @hkhailee, @cilemafacan and @MotiBaadror,

I got the same error (Mask in MaskedCrossEntropyLoss is all zeros) while trying to selflabel. Is there a possible way to selflabel without a need for a labeled validation dataset?

Thank you in advance!

@catweis
Copy link

catweis commented Apr 26, 2022

For me (at the moment wiht labeled data), this database-class works:

import sys, os
from PIL import Image
import cv2
from torch.utils.data import Dataset
sys.path.append(os.getcwd())

class OwnDataset(Dataset):

def __init__(self, img_paths, transform=None, class_names = ['bla'], im_size = 128):
    self.img_paths = img_paths
    self.transform = transform
    self.class_names = class_names
    self.im_size = im_size

def __len__(self):
    return len(self.img_paths)

def __getitem__(self, idx):

    img_filepath = self.img_paths[idx]
    img = cv2.imread(img_filepath)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    img = cv2.resize(img, (self.im_size, self.im_size))
    img = Image.fromarray(img)
    if self.transform is not None:
        img = self.transform(img)

    class_name = os.path.basename(os.path.dirname(img_filepath))
    target = self.class_names.index(class_name)

    out = {'image': img, 'target': target, 'meta': {'im_size': self.im_size,  'class_name': self.class_names}}

    return out

def get_image(self, idx):
    img_filepath = self.img_paths[idx]
    img = cv2.imread(img_filepath)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    img = cv2.resize(img, (self.im_size, self.im_size))
    img = Image.fromarray(img)
    if self.transform is not None:
        img = self.transform(img)
    return img

@catweis
Copy link

catweis commented Apr 26, 2022

For training with adding this for no classe names work:
if self.class_names is not None:
class_name = os.path.basename(os.path.dirname(img_filepath))
target = self.class_names.index(class_name)
else:
target = 0
class_name = 'No class'
However, model-evaluation without labels is still an issue...

@brunovianna
Copy link

in case anyone comes here looking for a simple script to create an unsupervised visualization from a collection of images, i just published this: https://github.com/brunovianna/collectionview

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests