-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unpack dictionary parameters #3905
base: develop
Are you sure you want to change the base?
Conversation
Thanks for this PR @bpmeek! We'll review it shortly 🙏🏼 |
kedro/pipeline/node.py
Outdated
if _input.startswith("**"): | ||
use_new = True | ||
dict_root = _input.split(":")[-1] | ||
_func_arguments = [arg for arg in inspect.signature(func).parameters] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there is a soft assumption that the dictionary keys match the function arguments. Do we want to throw a warning or error if:
(a) No arguments match
(b) Partial arguments match
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that is an assumption, I suppose we could also implement tuple unpacking if a user wanted to use index instead of key-value pairs.
The current behavior of no match/partial match is consistent I think with what users would expect. If you pass the dictionary model_options
and the function expects a key features
and that key is missing you would get the following error before the run.
ValueError: Pipeline input(s) {'params:model_options.features'} not found in the DataCatalog
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true! interested to see if others have an opinion though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would vote for adding at least basic validation because there are definitely cases in which this method will not work correctly, for example, when having a **kwargs
—only node function. Then, we could also extend tests with negative samples so it's clear which cases we cover and which we do not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused about how we would implement any validation at this point in the process, this function is called when the Node object is being initialized and I don't believe the catalog is available to verify against is it?
there are definitely cases in which this method will not work correctly, for example, when having a
**kwargs
—only node function.
Can you help me understand this? if your function was something like def foo(float_1, integer_2, string_3):
you could pass **params:dictionary
and it would work. If your function was def foo(**kwargs)
then this implementation wouldn't work but I wasn't under the impression that was what was being asked for, because you could just change it to def foo(kwargs)
and then access the values normally from there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I don't mean that this is the expected usage example and we aim to make it work. But it doesn't mean that someone cannot make that (or something else that breaks the current unpacking method) by mistake if syntax allows. There might be some other cases as well. To simplify the validation, we can follow the approach proposed above - check no/partial argument match and add some possible incorrect test cases to ensure the validation catches them. At the same time extend the docs with the expected usage example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the main outstanding discussion right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR!
Left some comments/suggestions regarding the implementation. My main push is to clarify which cases we cover by adding basic validation, negative tests and small usage example here
kedro/pipeline/node.py
Outdated
if _input.startswith("**"): | ||
use_new = True | ||
dict_root = _input.split(":")[-1] | ||
_func_arguments = [arg for arg in inspect.signature(func).parameters] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be outside of the loop since func
remains the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense and went ahead and made the change, but I'm not sure which would be more efficient, creating a list of every node's function or making the list multiple times if the input is a dictionary.
Returns: | ||
Either original inputs if no input was unpacked or a list of inputs if an input was unpacked. | ||
""" | ||
use_new = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest renaming it to params_unpacked
or similar for clarity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we please rename it?
use_new = True | ||
dict_root = _input.split(":")[-1] | ||
_func_arguments = [arg for arg in inspect.signature(func).parameters] | ||
for param in _func_arguments[idx:]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of slicing the list, we can just use idx
to start looping from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand what you're recommending.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for param_idx in range(idx, len(_func_arguments)):
new_inputs.append(f"params:{dict_root}.{_func_arguments[param_idx]}")
kedro/pipeline/node.py
Outdated
if _input.startswith("**"): | ||
use_new = True | ||
dict_root = _input.split(":")[-1] | ||
_func_arguments = [arg for arg in inspect.signature(func).parameters] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would vote for adding at least basic validation because there are definitely cases in which this method will not work correctly, for example, when having a **kwargs
—only node function. Then, we could also extend tests with negative samples so it's clear which cases we cover and which we do not.
@bpmeek Do you want to continue this PR? I am eager to see this being merged finally. |
@noklam, I can continue it, I think I'm just stuck right now on the best way to implement checking for partial matches, since the current process doesn't have access to the catalog when it's populating the required datasets and the verification doesn't know what dictionaries have been unpacked. |
Just quick thought this can be done at the node level. There are some node function argument parsing logic done in the %load_node feature already. The key here seems to be when should this validation step happen. I will put some more thoughts on this tmr. |
Changed name of function, Added checks for dictionary unpacking in function Added negative test for dictionary unpacking in function Signed-off-by: bpmeek <[email protected]>
Are the additional checks/tests here sufficient or is there still more to be done? |
Thanks for your patience @bpmeek! Let us look into this again. |
I have a long list of PRs so I will only have time to look at this next week earliest. |
Originally posted by @astrojuanlu in kedro-org/kedro-viz#1933 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry some comments on the validation required were not clear. I think to move forward with this PR we need to:
- Address comments regarding the implementation;
- Update the implementation description and clarify what cases are supported and what are not, so it was clear for the reviewers whether something is skipped intentionally;
- Add a validation based on 2. and make sure the rest cases are failing gracefully;
- Extend documentation with the example of added interface: https://docs.kedro.org/en/stable/nodes_and_pipelines/nodes.html#nodes
Thank you @bpmeek for the solution! Please let us know if you want to continue with it or need some help from our side.
@@ -251,6 +255,11 @@ def _map_transcode_base(name: str) -> str: | |||
base_name, transcode_suffix = _transcode_split(name) | |||
return TRANSCODING_SEPARATOR.join((mapping[base_name], transcode_suffix)) | |||
|
|||
def _matches_unpackable(name: str) -> bool: | |||
param_base = name.split(".")[0] | |||
matches = [True for key, value in mapping.items() if f"**{param_base}" in key] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
matches = [True for key, value in mapping.items() if f"**{param_base}" in key] | |
matches = [True for arg_name in mapping if f"**{param_base}" in arg_name] |
use_new = True | ||
dict_root = _input.split(":")[-1] | ||
_func_arguments = [arg for arg in inspect.signature(func).parameters] | ||
for param in _func_arguments[idx:]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for param_idx in range(idx, len(_func_arguments)):
new_inputs.append(f"params:{dict_root}.{_func_arguments[param_idx]}")
Description
In response to this ticket
Development notes
Can now unpack dictionaries in the
inputs
argument of a node, example below._unpack_params added to
kedro/kedro/pipeline/node.py
, this function iterates over node inputs and updatesnode.inputs
if needed.note: It is assumed that the dictionary entries and kwarg have the same name
Changes:
kedro/kedro/pipeline/modular_pipeline.py
was updated to not throw errors when mapping dataset names._is_single_parameter
returns true if name starts with**params:
_normalize_param_name
does not appendparams:
if the name already begins with**params:
_validate_datasets_exist
removes datasets that begin with**
fromnon_existent
, I have confirmed thatkedro/runner/runner.py
will catch these missing datasets.Undesirable behavior
Currently modular pipelines will still namespace the unpacked parameters, I'll need assistance with either updating the rules list or adjusting therename
function in a way that makes sense.EDIT: This is no longer applicable after commit
Examples:
nodes.py
pipeline.py
Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a
Signed-off-by
line in the commit message. See our wiki for guidance.If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.
Checklist
RELEASE.md
file