De-duplicate between image data collections #39292

Go-MinSeong · 2025-01-15T08:48:53Z

Go-MinSeong
Jan 15, 2025

I'm collecting data from cameras to create a training dataset, but I want to exclude similar or duplicate data as it would be wasteful, and I want to make sure the vector embeddings are close.

I am trying to perform the following methodology to achieve the above goal: When an image comes in, calculate its distance from the embedding values in the existing dataset (collection), and if the closest vector is farther than a(threshold), add it to the dataset (collection), otherwise pass it.

Is the above methodology efficient or correct?
Also, how should I set the threshold: should I just take the existing 100 images as a dataset and use the statistics?

I would appreciate it if you could answer these questions. Have a great day!

yhmo · 2025-01-15T10:46:51Z

yhmo
Jan 15, 2025
Collaborator

It is difficult to define the threshold. The score/distance values are not in a linear curve. If you use different embedding model, the threshold is different. You can do some tests to observe the result and determine a "good threshold", but perhaps this threshold doesn't work well in some other cases.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

De-duplicate between image data collections #39292

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

De-duplicate between image data collections #39292

Go-MinSeong Jan 15, 2025

Replies: 1 comment

yhmo Jan 15, 2025 Collaborator

Go-MinSeong
Jan 15, 2025

yhmo
Jan 15, 2025
Collaborator