Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AppWrapper integration #3953

Merged
merged 16 commits into from
Jan 15, 2025
Merged

Conversation

dgrove-oss
Copy link
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

Adds an AppWrapper integration to Kueue.

During the initial development of a Kueue-compatible AppWrapper, providing Kueue support via the external framework mechanism provided maximal flexibility. AppWrappers have reached a stable 1.0 release and we are interested in pursuing multi-Kueue support for AppWrappers (which is not supported for external frameworks). Additionally, the fault tolerance capabilities of AppWrapper and its ability to (a) group multiple resources into a single workload and (b) swiftly Kueue-enable other Kinds that are built on PodSpecTemplate may be of interest to the Kueue community. Therefore, we would like to contribute and maintain an AppWrapper integration for Kueue as a full-fledged built-in integration.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

I think TAS support is straightforward; will look at adding that (and related integration tests) next.

I think it makes sense to add AppWrapper to the e2e tests, especially as it gives us a way to test multi-level parent/child relationships. I didn't do that in this PR to keep it from getting too large. I would plan to do that in a subsequent PR.

Multi-Kueue support would also be a follow up PR. I think it is doable before 0.11, but we don't have much direct experience with multi-kueue since it hasn't been possible to do as an external framework.

Does this PR introduce a user-facing change?

Add an integration for AppWrappers to Kueue.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. labels Jan 9, 2025
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 9, 2025
Copy link

netlify bot commented Jan 9, 2025

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit a9a1d51
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/6786c0ab88ef1c00085dd562

@kannon92
Copy link
Contributor

I think it makes sense to add AppWrapper to the e2e tests, especially as it gives us a way to test multi-level parent/child relationships. I didn't do that in this PR to keep it from getting too large. I would plan to do that in a subsequent PR.

I lean towards adding at least a simple e2e test as part of this PR to make sure it works. But I'll leave that to the maintainers if it is required.

@mimowo
Copy link
Contributor

mimowo commented Jan 13, 2025

Thank you for the PR! I'm ok with including the integration as a built-in for Kueue as it nicely fits into our needs to support mixing training and inference.

Another good thing is that it provides us a way of deploying Deployments and StatefulSets via MultiKueue before the Pod Integration is supported for MultiKueue (#3802). Otherwise the support would be unlikely for 0.11.

My only concern is the potentially growing pool of integrations moved into Kueue due to the lack of support for external Jobs in MultiKueue (#2349). To prevent that I would suggest to prioritize it @mwielgus @mwysokin. In any case, it is unlikely to happen for 0.11.

I lean towards adding at least a simple e2e test as part of this PR to make sure it works. But I'll leave that to the maintainers if it is required.

I would also like to have an e2e test - could be this PR or a follow up if this one was manually tested.

BTW, it TAS supported? I believe not, because I see that SetPodSetInfos does not copy the SchedulingGates which needed. Do you plan to fill in the gap?

func (aw *AppWrapper) PodSets() []kueue.PodSet {
podSets, err := awutils.GetPodSets((*awv1beta2.AppWrapper)(aw))
if err != nil {
// Kueue will raise an error on zero length PodSet; the Kueue GenericJob API prevents propagating the actual error.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you indicate where the actual error will be raised?

I would prefer to avoid silencing errors, can we update the PodSets interface, could be a follow up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can definitely do a followup to add an error return to PodSets().

The theoretical possibility of an error return from awutils.GetPodSets() arises because if there isn't a cached version of the PodSets in the AppWrapper's status field, GetPodSets has to call unstructured.UnstructuredJSONScheme.Decode to process theruntime.RawExtension that represents the wrapped objects to let it build the PodSets. The validating web hook does this exact same decoding during AppWrapper creation and the update web hook enforces that the field holding the runtime.RawExtension is immutable. So an error happening here seems extremely unlikely, but totally agree that if it does happen that obscuring it makes it even harder to figure out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, in that case I would suggest: to add a log in this PR and open a TODO(#....) issue for the follow up

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Opened #3969 and added a TODO and log message for now.

@@ -52,6 +53,7 @@ require (
github.com/coreos/go-semver v0.3.1 // indirect
github.com/coreos/go-systemd/v22 v22.5.0 // indirect
github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc // indirect
github.com/distribution/reference v0.5.0 // indirect
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you briefly summarize what the dependecy is used for (same for the one below). I would suggest you to consider using dependabot so that the dependencies are not getting outdated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When it constructs the PodSetInfos from the runtime.RawExtension the AppWrapper controller has to replicate the processing of default values that the apiServer would apply to PodSpecsTemplates if they were created "normally.". The github.com/distribution/reference is used here to figure out if the image has a latest tag, which in turn is used to default the imagePullPolicy.

github.com/opencontainers/go-digest is pulled in as dependency by github.com/distribution/reference.

If we changed the equality comparison for PodSpecs to be less stringent (#3103), then we probably could avoid the need to do this defaulting, which would then eliminate the dependencies.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining. I'm ok with the dependency for now.

@mimowo
Copy link
Contributor

mimowo commented Jan 13, 2025

I went through the code and lgtm. Comments which can be addressed in follow ups:

  • e2e test for the integration
  • avoid silencing errors
  • consider using dependabot for the project to make sure the transitive dependencies are not outdated

@dgrove-oss
Copy link
Contributor Author

  • consider using dependabot for the project to make sure the transitive dependencies are not outdated

Thanks for pointing this out; I'd missed that dependbot hadn't been fully enabled by the settings for https://github.com/project-codeflare/appwrapper. It is now.

@dgrove-oss
Copy link
Contributor Author

For the e2e tests, I'm missing something that's going to be obvious in retrospect..

where is the logic that installs the desired version of the external controllers (training-operator, jobset controller, etc) on the test kind clusters? I'd need to extend to install the appwrapper controller.

@dgrove-oss
Copy link
Contributor Author

dgrove-oss commented Jan 13, 2025

BTW, it TAS supported? I believe not, because I see that SetPodSetInfos does not copy the SchedulingGates which needed. Do you plan to fill in the gap?

AppWrapper had been sitting at Kueue 0.8 (which didn't have the SchedulingGates) due to our need to run on OpenShift. But #3908 enables Kueue 0.10 to run on OpenShift, so we were able to move AppWrapper forward. I'll be looking at adding in SchedulingGate support this week, with a plan of making an AppWrapper point release once we have it. Then we can update the AppWrapper version dependency for Kueue and enable TAS support in the AppWrapper integration.

@dgrove-oss
Copy link
Contributor Author

dgrove-oss commented Jan 13, 2025

For the e2e tests, I'm missing something that's going to be obvious in retrospect..

where is the logic that installs the desired version of the external controllers (training-operator, jobset controller, etc) on the test kind clusters? I'd need to extend to install the appwrapper controller.

Found it.... and it was obvious in retrospect (makefile-test + e2e-common.sh)

@dgrove-oss
Copy link
Contributor Author

I added a simple e2e test of an AppWrapper around a batch.Job that checks the basic functionality of the integration (AppWrapepr is admitted, Job is created and runs successfully, AppWrapper enters a completed state). Test runs in 15ish seconds, so that seems like a reasonable start.

Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Parallelism(int32(numPods)).
Completions(int32(numPods)).
Suspend(false).
Image(util.E2eTestSleepImage, []string{"10s"}).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to make it end immediately like after 10ms. Is the only issue that you want to assert that Appswrapper is running? I think from the integration PoV he most important thing is that it gets scheduled and completes successfully.

I would expect the creation of pods etc to be tested outside of Kueue.

Maybe we can add that asserts when proven to be important based on specific issues.

Also, if we want to assert on the running Appswrapper and the pods we could have another test with longer timeout but after the asserts just kill the pods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I'll revise as suggested later today.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test simplified as requested. Runs a 10ms batch.Job. Only checks for AppWrapper (a) being unsuspended and (b) completing successfully.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm

@dgrove-oss
Copy link
Contributor Author

/test pull-kueue-verify-main

@mimowo
Copy link
Contributor

mimowo commented Jan 15, 2025

/lgtm
/approve

Awesome, thanks for the contribution! I think it will de-facto enable running custom Jobs on MultiKueue wrapped with the AppWrapper. As a follow up I would like yet to see the integration of AppWrapper with MultiKueue, based on managedBy with e2e tests for that.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 15, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 444f0c64bd562541dc5542f0fb25175de1d2c0cc

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgrove-oss, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 15, 2025
@k8s-ci-robot k8s-ci-robot merged commit d71331b into kubernetes-sigs:main Jan 15, 2025
17 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.11 milestone Jan 15, 2025
@dgrove-oss dgrove-oss deleted the appwrapper branch January 15, 2025 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants