-
Notifications
You must be signed in to change notification settings - Fork 279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AppWrapper integration #3953
AppWrapper integration #3953
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
I lean towards adding at least a simple e2e test as part of this PR to make sure it works. But I'll leave that to the maintainers if it is required. |
Thank you for the PR! I'm ok with including the integration as a built-in for Kueue as it nicely fits into our needs to support mixing training and inference. Another good thing is that it provides us a way of deploying Deployments and StatefulSets via MultiKueue before the Pod Integration is supported for MultiKueue (#3802). Otherwise the support would be unlikely for 0.11. My only concern is the potentially growing pool of integrations moved into Kueue due to the lack of support for external Jobs in MultiKueue (#2349). To prevent that I would suggest to prioritize it @mwielgus @mwysokin. In any case, it is unlikely to happen for 0.11.
I would also like to have an e2e test - could be this PR or a follow up if this one was manually tested. BTW, it TAS supported? I believe not, because I see that |
func (aw *AppWrapper) PodSets() []kueue.PodSet { | ||
podSets, err := awutils.GetPodSets((*awv1beta2.AppWrapper)(aw)) | ||
if err != nil { | ||
// Kueue will raise an error on zero length PodSet; the Kueue GenericJob API prevents propagating the actual error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you indicate where the actual error will be raised?
I would prefer to avoid silencing errors, can we update the PodSets
interface, could be a follow up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can definitely do a followup to add an error return to PodSets()
.
The theoretical possibility of an error return from awutils.GetPodSets()
arises because if there isn't a cached version of the PodSets in the AppWrapper's status field, GetPodSets
has to call unstructured.UnstructuredJSONScheme.Decode
to process theruntime.RawExtension
that represents the wrapped objects to let it build the PodSets. The validating web hook does this exact same decoding during AppWrapper creation and the update web hook enforces that the field holding the runtime.RawExtension
is immutable. So an error happening here seems extremely unlikely, but totally agree that if it does happen that obscuring it makes it even harder to figure out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, in that case I would suggest: to add a log in this PR and open a TODO(#....) issue for the follow up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Opened #3969 and added a TODO and log message for now.
@@ -52,6 +53,7 @@ require ( | |||
github.com/coreos/go-semver v0.3.1 // indirect | |||
github.com/coreos/go-systemd/v22 v22.5.0 // indirect | |||
github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc // indirect | |||
github.com/distribution/reference v0.5.0 // indirect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you briefly summarize what the dependecy is used for (same for the one below). I would suggest you to consider using dependabot so that the dependencies are not getting outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When it constructs the PodSetInfos from the runtime.RawExtension
the AppWrapper controller has to replicate the processing of default values that the apiServer would apply to PodSpecsTemplates if they were created "normally.". The github.com/distribution/reference
is used here to figure out if the image has a latest
tag, which in turn is used to default the imagePullPolicy
.
github.com/opencontainers/go-digest
is pulled in as dependency by github.com/distribution/reference
.
If we changed the equality comparison for PodSpecs to be less stringent (#3103), then we probably could avoid the need to do this defaulting, which would then eliminate the dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining. I'm ok with the dependency for now.
I went through the code and lgtm. Comments which can be addressed in follow ups:
|
Thanks for pointing this out; I'd missed that dependbot hadn't been fully enabled by the settings for https://github.com/project-codeflare/appwrapper. It is now. |
For the e2e tests, I'm missing something that's going to be obvious in retrospect.. where is the logic that installs the desired version of the external controllers (training-operator, jobset controller, etc) on the test kind clusters? I'd need to extend to install the appwrapper controller. |
AppWrapper had been sitting at Kueue 0.8 (which didn't have the SchedulingGates) due to our need to run on OpenShift. But #3908 enables Kueue 0.10 to run on OpenShift, so we were able to move AppWrapper forward. I'll be looking at adding in SchedulingGate support this week, with a plan of making an AppWrapper point release once we have it. Then we can update the AppWrapper version dependency for Kueue and enable TAS support in the AppWrapper integration. |
b0e7fae
to
88bb3a2
Compare
Found it.... and it was obvious in retrospect (makefile-test + e2e-common.sh) |
977f758
to
fd21974
Compare
fd21974
to
8a7e500
Compare
I added a simple e2e test of an AppWrapper around a batch.Job that checks the basic functionality of the integration (AppWrapepr is admitted, Job is created and runs successfully, AppWrapper enters a completed state). Test runs in 15ish seconds, so that seems like a reasonable start. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Parallelism(int32(numPods)). | ||
Completions(int32(numPods)). | ||
Suspend(false). | ||
Image(util.E2eTestSleepImage, []string{"10s"}). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to make it end immediately like after 10ms. Is the only issue that you want to assert that Appswrapper is running? I think from the integration PoV he most important thing is that it gets scheduled and completes successfully.
I would expect the creation of pods etc to be tested outside of Kueue.
Maybe we can add that asserts when proven to be important based on specific issues.
Also, if we want to assert on the running Appswrapper and the pods we could have another test with longer timeout but after the asserts just kill the pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I'll revise as suggested later today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test simplified as requested. Runs a 10ms batch.Job. Only checks for AppWrapper (a) being unsuspended and (b) completing successfully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm
/test pull-kueue-verify-main |
/lgtm Awesome, thanks for the contribution! I think it will de-facto enable running custom Jobs on MultiKueue wrapped with the AppWrapper. As a follow up I would like yet to see the integration of AppWrapper with MultiKueue, based on managedBy with e2e tests for that. |
LGTM label has been added. Git tree hash: 444f0c64bd562541dc5542f0fb25175de1d2c0cc
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dgrove-oss, mimowo The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Adds an AppWrapper integration to Kueue.
During the initial development of a Kueue-compatible AppWrapper, providing Kueue support via the external framework mechanism provided maximal flexibility. AppWrappers have reached a stable 1.0 release and we are interested in pursuing multi-Kueue support for AppWrappers (which is not supported for external frameworks). Additionally, the fault tolerance capabilities of AppWrapper and its ability to (a) group multiple resources into a single workload and (b) swiftly Kueue-enable other Kinds that are built on PodSpecTemplate may be of interest to the Kueue community. Therefore, we would like to contribute and maintain an AppWrapper integration for Kueue as a full-fledged built-in integration.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
I think TAS support is straightforward; will look at adding that (and related integration tests) next.
I think it makes sense to add AppWrapper to the e2e tests, especially as it gives us a way to test multi-level parent/child relationships. I didn't do that in this PR to keep it from getting too large. I would plan to do that in a subsequent PR.
Multi-Kueue support would also be a follow up PR. I think it is doable before 0.11, but we don't have much direct experience with multi-kueue since it hasn't been possible to do as an external framework.
Does this PR introduce a user-facing change?