[Spike] - Explore how to validate dataset classes in catalog using LSP #5

astrojuanlu · 2024-05-09T06:06:42Z

Spotted this with https://github.com/Galileo-Galilei/kedro-mlflow kedro_mlflow.io.models.MlflowModelTrackingDataset

Not sure if this should be fixed on the extension or on the schema.

The text was updated successfully, but these errors were encountered:

noklam · 2024-05-09T10:04:18Z

Good question! The schema was originally a local setting where you can point to a file (and thus you can edit the file to include new datasets). This is now not possible.Ideally I would like it to version against kedro-datasets, but it's not possible to ship different version of schema with the extension.

I would love to access the API of https://github.com/redhat-developer/vscode-yaml so I can potentially support these in a flexible way, unfortunately it's not well documented so I didn't pursue.

The current solution relies on RedHat's YAML extension. There are two other options:

Use the native schema validation vscode support - https://github.com/github/vscode-github-actions/blob/main/language/syntaxes/expressions.tmGrammar.json . Examples are Github Action extension, which seems to give much better editing experience when I try. This requires understanding how the grammar works and writing lots of regex. Maybe most of them can be copy and paste, but I am not sure at this point.
LSP - we can explore other options where the schema validation happens dynamically and the LSP will give a response directly. We may need a "validation" method on the datasets or some kind of pydantic like model validation and relies on their error message.

khub41 · 2024-05-13T21:44:57Z

Having this problem too when using custom datasets from a another package

michal-mmm · 2024-05-27T07:38:12Z

The problem also applies to YAML anchors/aliases:

_spark_parquet: &spark_parquet
  type: spark.SparkDataset
  save_args:
    mode: overwrite

astrojuanlu · 2024-05-27T08:53:16Z

Thanks @michal-mmm, I can reproduce.

Entries starting with underscore aren't treated as datasets in general (couldn't find it documented @noklam?)

noklam · 2024-09-06T20:20:17Z

@michal-mmm Sorry for late reply, indeed the schema doesn't understand YAML anchor / template.

noklam · 2024-09-06T20:20:58Z

i just got off from a call with a client using kedro and databricks, he gave some feedback regarding kedro extension in VSC 😄
he mentioned they have a custom dataset and it does not get picked by VSC extension and it throws error saying the dataset class does not exist

Some more evidence suggest that this may need to be handled sooner than later.

noklam · 2024-09-06T20:42:50Z

I anticipated this will become a problem, mainly because of two issues:

The JSON is a fixed schema, it's not up to date and does not understand user custom dataset (or plugin datasets)
- In theory user can update the schema themselves, but it's not trivial and the UX is awful.

I thought we can at least disable this via the experimental flag, but I was wrong. Because the extension is limited by the Redhat YAML extension, right now it work by registering a special JSON file. If an user want to change the schema, they need to get a local copy of the schema, and add the new dataset they want.

This is obviously not ideal, so there are better approach discussed in this comment:

https://code.visualstudio.com/api/language-extensions/syntax-highlight-guide#tokenization Using TextMateGrammar which is supported by VSCode natively.
LSP (need to execute code)

Using 1. can give us more flexibility, but it's mostly validate tokens so I don't think it can solves all the problems. For example, type: pandas.CSVDataset and type: dummy.Dataset both looks valid from a syntax point of view.
This suggest that if we want to solve most of the problem, we may need to rely on some runtime checking. Consider a dataset as follow:

my_dataset:
  type: my_dataset.MyDataset

It make more sense to have Dataset act almost like a Pydantic validator to check if the arguments valid. This is also easier to maintain because we don't have to hardcode any rule, it should just use the class argument directly (we may need to additional work to throw some nice error message etc). This will also support different dataset version & custom dataset, because the validation will use the user environment. (This assume the dependencies has to be installed, ofcourse).

Going for 2. seems most reasonable for me, it will takes some more effort and design before, but I think it's time to do a spike.

noklam · 2024-11-04T15:59:07Z

Related discussion: kedro-org/kedro#4196
Related ticket: kedro-org/kedro#4258

astrojuanlu · 2024-11-04T23:39:29Z

Even if we update our schema, that will never fix the red underline with custom datasets.

On the other hand, even if the YAML file contains a dataset that's defined in the schema, the pipeline might still not be able to run if said dataset is not installed.

We agreed to try to check for "dataset can be used" rather than just "catalog.yml syntax is correct".

For this, most likely we'll need the LSP.

noklam · 2024-11-27T06:35:43Z

#159 is already looking good!

astrojuanlu mentioned this issue Aug 27, 2024

gives error on correct type: polars.CSVDataset in catalog #79

Closed

2 tasks

noklam added the bug Something isn't working label Sep 23, 2024

noklam added this to Kedro VS Code Sep 25, 2024

noklam moved this to Todo in Kedro VS Code Sep 25, 2024

jitu5 mentioned this issue Sep 25, 2024

[Parent] - Kedro VSCode IDE Roadmap #58

Open

33 tasks

noklam changed the title ~~Extension complains about dataset not from kedro-datasets~~ [Spike] - Extension complains about dataset not from kedro-datasets Nov 4, 2024

astrojuanlu mentioned this issue Nov 4, 2024

YAML variables marked as invalid in catalog.yml #148

Open

astrojuanlu changed the title ~~[Spike] - Extension complains about dataset not from kedro-datasets~~ [Spike] - Explore how to validate dataset classes in catalog using LSP Nov 4, 2024

rashidakanchwala assigned SajidAlamQB Nov 6, 2024

rashidakanchwala added this to Kedro-Viz Nov 6, 2024

rashidakanchwala moved this to Todo in Kedro-Viz Nov 11, 2024

SajidAlamQB linked a pull request Nov 25, 2024 that will close this issue

[SPIKE] Test validate dataset classes in catalog using LSP #159

Open

noklam linked a pull request Nov 27, 2024 that will close this issue

[SPIKE] Test validate dataset classes in catalog using LSP #159

Open

astrojuanlu moved this from In Progress to In Review in Kedro-Viz Nov 29, 2024

SajidAlamQB moved this from In Review to Todo in Kedro-Viz Jan 9, 2025

rashidakanchwala moved this from Todo to In Review in Kedro-Viz Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spike] - Explore how to validate dataset classes in catalog using LSP #5

[Spike] - Explore how to validate dataset classes in catalog using LSP #5

astrojuanlu commented May 9, 2024

noklam commented May 9, 2024

khub41 commented May 13, 2024

michal-mmm commented May 27, 2024

astrojuanlu commented May 27, 2024

noklam commented Sep 6, 2024

noklam commented Sep 6, 2024 •

edited

Loading

noklam commented Sep 6, 2024 •

edited

Loading

noklam commented Nov 4, 2024

astrojuanlu commented Nov 4, 2024

noklam commented Nov 27, 2024

[Spike] - Explore how to validate dataset classes in catalog using LSP #5

[Spike] - Explore how to validate dataset classes in catalog using LSP #5

Comments

astrojuanlu commented May 9, 2024

noklam commented May 9, 2024

khub41 commented May 13, 2024

michal-mmm commented May 27, 2024

astrojuanlu commented May 27, 2024

noklam commented Sep 6, 2024

noklam commented Sep 6, 2024 • edited Loading

noklam commented Sep 6, 2024 • edited Loading

noklam commented Nov 4, 2024

astrojuanlu commented Nov 4, 2024

noklam commented Nov 27, 2024

noklam commented Sep 6, 2024 •

edited

Loading

noklam commented Sep 6, 2024 •

edited

Loading