Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compile 5 suitable datasets to test out the Copypasta test #36

Open
1 of 5 tasks
andi-halim opened this issue Dec 6, 2024 · 2 comments
Open
1 of 5 tasks

Compile 5 suitable datasets to test out the Copypasta test #36

andi-halim opened this issue Dec 6, 2024 · 2 comments
Assignees
Labels
priority: 1 Highest priority assignment

Comments

@andi-halim
Copy link
Collaborator

andi-halim commented Dec 6, 2024

Background

We want to gather 5 datasets in this folder https://drive.google.com/drive/u/0/folders/1OL7toRas_JbCw5nSMbbWb7j6yZciO5r5 that we are confident in in order to limit test and take notes on the copypasta test.

Data Requirements:

  • Unique Post Username
  • Unique Post Number
  • Post Content

Problems

  • Find a group of datasets that are diverse enough poses different challenges to the software

Desired Outcome

Be ready to go in testing and filling out a questionnaire regarding each of these datasets.

Tasks

  • Collect dataset 1: 24006 Ukraine Russia Tweets
  • Collect dataset 2
  • Collect dataset 3
  • Collect dataset 4
  • Collect dataset 5
@DeanEby
Copy link
Collaborator

DeanEby commented Dec 17, 2024

I am not currently able to load in the English and Turkish Misinformation Detection Dataset as it is requiring the message post time which is not in the dataset. I'll generate some filler time data, but I wanted to mention since I thought this was not supposed to be required.

@andi-halim andi-halim added the priority: 1 Highest priority assignment label Jan 9, 2025
@andi-halim
Copy link
Collaborator Author

andi-halim commented Jan 9, 2025

@DeanEby responding to your comment from 3 weeks ago! I found the same issue when working through the CLI. I resolved by just selecting some arbitrary column and moving on as I don't believe the test uses that column at all.

Image

Above Doc

@soul-codes Is this a bug that you see as well? We are seeing the copypasta test ask for a message post time column when it's not actually needed.

  • this does allude back to what you brought up before about doing the inference + column matching with a dataset before choosing a test and maybe saving those matching settings to a JSON to stick with your project directory. This is probably a more stretch fix but something we could totally prioritize more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: 1 Highest priority assignment
Projects
None yet
Development

No branches or pull requests

2 participants