De-duplicate between image data collections #39292
Unanswered
Go-MinSeong
asked this question in
Q&A and General discussion
Replies: 1 comment
-
It is difficult to define the threshold. The score/distance values are not in a linear curve. If you use different embedding model, the threshold is different. You can do some tests to observe the result and determine a "good threshold", but perhaps this threshold doesn't work well in some other cases. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm collecting data from cameras to create a training dataset, but I want to exclude similar or duplicate data as it would be wasteful, and I want to make sure the vector embeddings are close.
I am trying to perform the following methodology to achieve the above goal: When an image comes in, calculate its distance from the embedding values in the existing dataset (collection), and if the closest vector is farther than a(threshold), add it to the dataset (collection), otherwise pass it.
Is the above methodology efficient or correct?
Also, how should I set the threshold: should I just take the existing 100 images as a dataset and use the statistics?
I would appreciate it if you could answer these questions. Have a great day!
Beta Was this translation helpful? Give feedback.
All reactions