-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset requests #8
Comments
This is more on Vasily's side, but a way to turn the id we use in the dataloader into the id in the capnp dataset (again, if not already existing). |
Already existed.
Done
Done
Done. Feel free to make further improvements.
This can of course be written, but for me these kind of things are usually throwaway scripts. You can write these kinds of things in +/- 15 lines of python code. Everyone has slightly different needs from such scripts, and I don't want to write scripts with a million options for this. So I don't think I'll be writing these. But if someone else wants to do it, that is fine with me.
This is something that is possible, but not trivial. I'm also interested in this myself, but not really for any machine learning purposes, but rather for proof engineering purposes. It seems to me that our dataset could also be very useful to folks that are engaged with that (see also https://proofengineering.org/). However, I don't immediately see any machine learning application for this, and as such I'm not regarding it as a high priority. What kind of usecase do you have in mind? What kind of precision do you need? Do you only need location information for definitions or do you also need the locations of tactics? |
The Coq file reference isn't a top priority. The reason I suggested it is that often I'm looking at coq data, and I wonder to myself, what does the proof or definition actually look like (as well as some of the surronding stuff). Right now I just google |
As for scripts, that is fine. I'll write my own. (Jelle and I were talking recently about some data it would be nice to have extracted.) Where is a good place for me to save the scripts? I can keep them to myself of course, or share them. If I share them, is there a place in your repo, or should I put them in graph2tac? |
I guess it is a good idea to put the scripts in this repo because then they are synchronized with the API. Maybe just put them in a sub-directory of pytact: https://github.com/coq-tactician/coq-tactician-reinforce/tree/coq8.11/pytact We can always reorganize this later. But please keep the scripts simple and general. The idea is not to put an entire neural network in this repo. If you want me to update your scripts when the API changes I'll ask you to either (1) use a strongly typed language or (2) provide easy-to-run tests. If not, that is fine too, but then I'd like you to update them yourself. |
The source for the current datasets is here https://github.com/coq/coq/tree/V8.11.2/theories and here https://github.com/coq/coq/tree/V8.11.2/plugins by the way (mind the tag). I can distribute the source files with the dataset I guess. And for this dataset that may not be a problem. But when we grow the dataset, we'll have to figure out how to deal with licenses. How can we license our dataset? Is the dataset transformative? Is it fair use? Which copyright jurisdiction even applies here? Certainly we cannot re-license the source files so those we will have to license as they were originally, unless that license allows re-licensing... It's a mess. The free software foundation has a whole call-for-opinions on this topic w.r.t Copilot: https://www.fsf.org/blogs/licensing/fsf-funded-call-for-white-papers-on-philosophical-and-legal-questions-around-copilot |
If you'd like any information to be added to the dataset that is not already there please list it here.
The text was updated successfully, but these errors were encountered: