Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spike] - Explore how to validate dataset classes in catalog using LSP #5

Open
Tracked by #58
astrojuanlu opened this issue May 9, 2024 · 10 comments · May be fixed by #159
Open
Tracked by #58

[Spike] - Explore how to validate dataset classes in catalog using LSP #5

astrojuanlu opened this issue May 9, 2024 · 10 comments · May be fixed by #159
Assignees
Labels
bug Something isn't working

Comments

@astrojuanlu
Copy link
Member

Spotted this with https://github.com/Galileo-Galilei/kedro-mlflow kedro_mlflow.io.models.MlflowModelTrackingDataset

image

Not sure if this should be fixed on the extension or on the schema.

@noklam
Copy link
Contributor

noklam commented May 9, 2024

Good question! The schema was originally a local setting where you can point to a file (and thus you can edit the file to include new datasets). This is now not possible.Ideally I would like it to version against kedro-datasets, but it's not possible to ship different version of schema with the extension.

I would love to access the API of https://github.com/redhat-developer/vscode-yaml so I can potentially support these in a flexible way, unfortunately it's not well documented so I didn't pursue.

The current solution relies on RedHat's YAML extension. There are two other options:

  1. Use the native schema validation vscode support - https://github.com/github/vscode-github-actions/blob/main/language/syntaxes/expressions.tmGrammar.json . Examples are Github Action extension, which seems to give much better editing experience when I try. This requires understanding how the grammar works and writing lots of regex. Maybe most of them can be copy and paste, but I am not sure at this point.
  2. LSP - we can explore other options where the schema validation happens dynamically and the LSP will give a response directly. We may need a "validation" method on the datasets or some kind of pydantic like model validation and relies on their error message.

@khub41
Copy link

khub41 commented May 13, 2024

Having this problem too when using custom datasets from a another package
Capture d’écran 2024-05-13 à 15 44 20

@michal-mmm
Copy link

The problem also applies to YAML anchors/aliases:

_spark_parquet: &spark_parquet
  type: spark.SparkDataset
  save_args:
    mode: overwrite
img

@astrojuanlu
Copy link
Member Author

Thanks @michal-mmm, I can reproduce.

image

Entries starting with underscore aren't treated as datasets in general (couldn't find it documented @noklam?)

@noklam
Copy link
Contributor

noklam commented Sep 6, 2024

@michal-mmm Sorry for late reply, indeed the schema doesn't understand YAML anchor / template.

@noklam
Copy link
Contributor

noklam commented Sep 6, 2024

i just got off from a call with a client using kedro and databricks, he gave some feedback regarding kedro extension in VSC 😄
he mentioned they have a custom dataset and it does not get picked by VSC extension and it throws error saying the dataset class does not exist

Some more evidence suggest that this may need to be handled sooner than later.

@noklam
Copy link
Contributor

noklam commented Sep 6, 2024

I anticipated this will become a problem, mainly because of two issues:

  • The JSON is a fixed schema, it's not up to date and does not understand user custom dataset (or plugin datasets)
    • In theory user can update the schema themselves, but it's not trivial and the UX is awful.

I thought we can at least disable this via the experimental flag, but I was wrong. Because the extension is limited by the Redhat YAML extension, right now it work by registering a special JSON file. If an user want to change the schema, they need to get a local copy of the schema, and add the new dataset they want.

This is obviously not ideal, so there are better approach discussed in this comment:

  1. https://code.visualstudio.com/api/language-extensions/syntax-highlight-guide#tokenization Using TextMateGrammar which is supported by VSCode natively.
  2. LSP (need to execute code)

Using 1. can give us more flexibility, but it's mostly validate tokens so I don't think it can solves all the problems. For example, type: pandas.CSVDataset and type: dummy.Dataset both looks valid from a syntax point of view.
This suggest that if we want to solve most of the problem, we may need to rely on some runtime checking. Consider a dataset as follow:

my_dataset:
  type: my_dataset.MyDataset

It make more sense to have Dataset act almost like a Pydantic validator to check if the arguments valid. This is also easier to maintain because we don't have to hardcode any rule, it should just use the class argument directly (we may need to additional work to throw some nice error message etc). This will also support different dataset version & custom dataset, because the validation will use the user environment. (This assume the dependencies has to be installed, ofcourse).

Going for 2. seems most reasonable for me, it will takes some more effort and design before, but I think it's time to do a spike.

@noklam noklam added the bug Something isn't working label Sep 23, 2024
@noklam noklam moved this to Todo in Kedro VS Code Sep 25, 2024
@noklam
Copy link
Contributor

noklam commented Nov 4, 2024

Related discussion: kedro-org/kedro#4196
Related ticket: kedro-org/kedro#4258

@noklam noklam changed the title Extension complains about dataset not from kedro-datasets [Spike] - Extension complains about dataset not from kedro-datasets Nov 4, 2024
@astrojuanlu astrojuanlu changed the title [Spike] - Extension complains about dataset not from kedro-datasets [Spike] - Explore how to validate dataset classes in catalog using LSP Nov 4, 2024
@astrojuanlu
Copy link
Member Author

Even if we update our schema, that will never fix the red underline with custom datasets.

On the other hand, even if the YAML file contains a dataset that's defined in the schema, the pipeline might still not be able to run if said dataset is not installed.

We agreed to try to check for "dataset can be used" rather than just "catalog.yml syntax is correct".

For this, most likely we'll need the LSP.

@noklam
Copy link
Contributor

noklam commented Nov 27, 2024

#159 is already looking good!

@astrojuanlu astrojuanlu moved this from In Progress to In Review in Kedro-Viz Nov 29, 2024
@SajidAlamQB SajidAlamQB moved this from In Review to Todo in Kedro-Viz Jan 9, 2025
@rashidakanchwala rashidakanchwala moved this from Todo to In Review in Kedro-Viz Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Todo
Status: In Review
Development

Successfully merging a pull request may close this issue.

5 participants