-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spike] - Explore how to validate dataset classes in catalog using LSP #5
Comments
Good question! The schema was originally a local setting where you can point to a file (and thus you can edit the file to include new datasets). This is now not possible.Ideally I would like it to version against I would love to access the API of The current solution relies on RedHat's YAML extension. There are two other options:
|
Thanks @michal-mmm, I can reproduce. Entries starting with underscore aren't treated as datasets in general (couldn't find it documented @noklam?) |
@michal-mmm Sorry for late reply, indeed the schema doesn't understand YAML anchor / template. |
Some more evidence suggest that this may need to be handled sooner than later. |
I anticipated this will become a problem, mainly because of two issues:
I thought we can at least disable this via the experimental flag, but I was wrong. Because the extension is limited by the Redhat YAML extension, right now it work by registering a special JSON file. If an user want to change the schema, they need to get a local copy of the schema, and add the new dataset they want. This is obviously not ideal, so there are better approach discussed in this comment:
Using 1. can give us more flexibility, but it's mostly validate tokens so I don't think it can solves all the problems. For example, my_dataset:
type: my_dataset.MyDataset It make more sense to have Dataset act almost like a Pydantic validator to check if the arguments valid. This is also easier to maintain because we don't have to hardcode any rule, it should just use the class argument directly (we may need to additional work to throw some nice error message etc). This will also support different dataset version & custom dataset, because the validation will use the user environment. (This assume the dependencies has to be installed, ofcourse). Going for 2. seems most reasonable for me, it will takes some more effort and design before, but I think it's time to do a spike. |
Related discussion: kedro-org/kedro#4196 |
kedro-datasets
kedro-datasets
kedro-datasets
Even if we update our schema, that will never fix the red underline with custom datasets. On the other hand, even if the YAML file contains a dataset that's defined in the schema, the pipeline might still not be able to run if said dataset is not installed. We agreed to try to check for "dataset can be used" rather than just " For this, most likely we'll need the LSP. |
#159 is already looking good! |
Spotted this with https://github.com/Galileo-Galilei/kedro-mlflow
kedro_mlflow.io.models.MlflowModelTrackingDataset
Not sure if this should be fixed on the extension or on the schema.
The text was updated successfully, but these errors were encountered: