Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-5027: DRA: admin-controlled device attributes #5034

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Jan 10, 2025

/cc @johnbelamaric

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 10, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pohly
Once this PR has been reviewed and has the lgtm label, please assign johnbelamaric for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 10, 2025
@k8s-ci-robot
Copy link
Contributor

@pohly: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-enhancements-verify 531a905 link true /test pull-enhancements-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@pohly
Copy link
Contributor Author

pohly commented Jan 12, 2025

/cc @KobayashiD27

For the "device priority" use case.

/cc @byako

For device health.

@k8s-ci-robot k8s-ci-robot requested a review from byako January 12, 2025 12:31
@k8s-ci-robot
Copy link
Contributor

@pohly: GitHub didn't allow me to request PR reviews from the following users: KobayashiD27.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @KobayashiD27

For the "device priority" use case.

/cc @byako

For device health.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Capacity map[QualifiedName]DeviceCapacity
}

// AttributeNamePriority is a standardized attribute name. Its value must be an integer.
Copy link
Contributor Author

@pohly pohly Jan 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or this?

Suggested change
// AttributeNamePriority is a standardized attribute name. Its value must be an integer.
// AttributeNamePriority is an attribute name defined by Kubernetes. Its value must be an integer.

/cc @johnbelamaric

@pohly
Copy link
Contributor Author

pohly commented Jan 13, 2025

/wg device-management
/sig node

@k8s-ci-robot k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Jan 13, 2025
The scheduler must merge these additional attributes with the ones provided by
the DRA drivers. The "kubernetes.io/offline" string attribute contains a
free-form explanation why the device is not currently available. Such a device
must be ignored by the scheduler. The "kubernetes.io/priority" integer defines
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eero-t asked in #5027 (comment):

How admin should run test workload(s) on device which scheduling has been disabled (e.g. for firmware upgrade), to know whether it can be enabled egain (for production workloads)?

With node taints, one would use taint tolerance for this, but I don't seem from KEP description how similar thing would be achieved for DRA devices.

This is indeed not possible as described here. How about making it configurable whether an offline device is used?

The "normal" DeviceClass that users should pick for production workloads could have a selector which excludes offline devices.

Then there is a second DeviceClass which doesn't exclude them. There's nothing that would prevent users from using that, but if they do, they do at their own risk. This is on-par with node taints.

Copy link

@eero-t eero-t Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Match for all offline devices is not enough, as there can be multiple reasons for offline, e.g. health and administration => selection would need to be specific to given offline reason, and not match if there are other reasons.

(With taints, one could use e.g. fw-upgrade taint, and its toleration. While one could still taint whole nodes, that could be rather disruptive, whereas by offlining devices one-by-one, upgrades would cause only slight service degradation while they are being performed / tested / verified.)

Copy link
Contributor Author

@pohly pohly Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The admin can create a custom DeviceClass with a selector which matches exactly the reason they chose when taking the device offline. The ResourceSliceOverride then has kubernetes.io/offline: fw-upgrade and the workload's DeviceClass has device.attributes["kubernetes.io"].offline == "fw-upgrade".

But that doesn't cover the case where a manually created ResourceSliceOverride contains such a kubernetes.io/offline: fw-upgrade and another, automatically created one has kubernetes.io/offline: unhealthy. The admin can make sure that "its" value wins via resourcesliceoverride.spec.rank, but then the kubernetes.io/offline: unhealthy gets lost.

We could specify a different merging strategy for this well-known attribute: instead of keeping exactly one entry, the different instances could be numbered, leading to kubernetes.io/offline: fw-upgrade; kubernetes.io/offline-1: unhealthy. The CEL expressions become a bit more complex, but it would work.

Yet another alternative is to extend the CEL environment so that device.attributes["kubernetes.io"].offline is a list of strings. This might be better than the value name "hack".

Copy link
Contributor

@everpeace everpeace Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to the KEP. And sorry to cut-in this discussion. I'm curious in the usecase of expossing the device health via ResourceSliceOverride.

In this usecase, are there any ideas how could online/offline status changes affect to running workloads?? In this case, it might be useful for users to introduce device level toleration for more flexible control?

The below is an imaginary spec that I know this is just a juvenile suggestion(it definitely needs more deep considerations):

apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaim 
spec:
  devices:
    requests:
    - name: gpu
      deviceClassName: gpu.nvidia.com
      # New field
      #   - It might be better to introduce taints information in device side, too?
      #   - toletion should be defined in ResourceClass side?
      tolerations:  
      - cel:
          expression: "device.attributes["kubernetes.io"].offline != ""
        effect: NoExecute | NoSchedule
        tolerationSeconds: 30s # this is effective only when 'NoExecute'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reacting to offline on the node for a running workload is currently out-of-scope, but I can see how it would be useful to do something even if that means making the ResourceClaim API more complex. It also means that the kubelet needs to become aware of this because a controller cannot force containers to stop, can it?

We could start without it in 1.33, then add such an API in 1.34 (still as alpha!).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we state this in the KEP??

Something needs to be in the KEP about this: either that it is currently out-of-scope or how something works that is in scope. I'm still trying to find out what's possible 😅

node taints are controlled by "taint-eviction-controller" in "controller-manager"

I need to look into that. Such a solution would be much better than having the kubelet do it. We want to support new features with old kubelet versions (version skew) and in general keep the kubelet as dumb as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The relevant code in the taint-eviction controller is https://github.com/kubernetes/kubernetes/blob/2d0a4f75560154454682b193b42813159b20f284/pkg/controller/tainteviction/taint_eviction.go#L129-L147

As I suspected, there is no separate API which causes a pod to stop. It has to be deleted, which then causes kubelet to stop containers and call the driver's NodeUnprepareResources. The downside (?) is that once it has stopped, the pod object is also going to get removed. But perhaps that's the behavior that we want?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API design question: should apps explicitly have to opt into eviction?

In other words, if this new tolerations is empty, should the default behavior be this:

      tolerations:  
      - cel:
          expression: "device.attributes["kubernetes.io"].offline != ""
        effect: NoSchedule

My preference is to not actually make that a default in the apiserver, i.e. if the field is empty, it remains empty when stored. The benefit is that if the default changes, the new implicit default will be used for existing objects instead of being forced to use the old default because it is in the stored object and might have been chosen by the user explicitly.

Another API design aspect: being able to format a message with CEL in case that the toleration triggers seems useful.

Copy link
Contributor

@everpeace everpeace Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has to be deleted, which then causes kubelet to stop containers and call the driver's NodeUnprepareResources. The downside (?) is that once it has stopped, the pod object is also going to get removed. But perhaps that's the behavior that we want?

Although I'm not familiar with the ResourceClaim lifecycle so much, I imagined that when a scheduled pod with some ResourceClaims was deleted, the devices bound to the ResouceClaims were also unbound. If so, I can feel the behavior would be natural.

API design question: should apps explicitly have to opt into eviction?

I originally supposed:

  • device.attributes["kubernetes.io"].offline defines its offline status. i.e. the device with device.attributes["kubernetes.io"].offline = "some-reason" represents offline with some reason.
  • And then, the default behavior (i.e. in case no tolerations are defined in ResourceClaim) with offline devices might be
    • no schedule: device request can not allocate offline devices
    • no execution: the pods with offline devices will be evicted(deleted) when detected
  • Tolerations in ResourceClaim might work like this:
    • cel:
        expression: device.attributes["kubernetes.io"].offline != 'some-reason'
      effect: NoSchedule
      --> a device request with this toleration can allocate a offline device only when the reason was 'some-reason'.
    • cel:
        expression: device.attributes["kubernetes.io"].offline != 'some-reason'
      effect: NoExecute
      --> device allocation with this toleration can keep running only when offline reason was 'some-reason'.

This supposed behaviors depend on device.attributes["kubernetes.io"].offline convention (i.e. all the DRA drivers need to follow the attribute declaring unhealthiness and its reason).

IF we'd like to give more flexible ways for DRA driver designers to define their devices' unusual(or abnormal) conditions, we might want to introduce standard taint/toleration semantics in ResourceSlice/ResourceClaim as we have in Node/Pod, like below. Yeah, I know this would be too much complicated😿...

kind: ResourceSlice
spec:
  driver: gpu.nvidia.com
  devices:
  - name: xxx-xxx
    attributes:
      ...
    # These taints are expected to be reported by its DRA driver
    # Perhaps taints can also declare in ResourceSliceOverride, too.
    taints:
    - key: offline
      value: maintenance
      effect: NoSchedule
    - key: offline
      value: maintenance
      effect: NoExecute
---
kind: ResourceClaim
spec:
  requests:
    - name: gpu
      deviceClassName: gpu.nvidia.com
      # This device request and resultant allocation 
      # can only tolerate 'offline=maintenance:NoExecute' taint
      # This means:
      # - device with 'offline=maintenance:NoSchedule' taint won't allocate to this request (can't tolerate the taint)
      # - but this allocation can tolerate(keep running) when the device has 'offline=maintenance:NoExecute' taint
      tolerations:
      - key: offline
        value: maintenance
        effect: NoExecute

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I'm not familiar with the ResourceClaim lifecycle so much, I imagined that when a scheduled pod with some ResourceClaims was deleted, the devices bound to the ResouceClaims were also unbound.

That's correct. But devices also get deallocated when its consuming pod(s) are in a known final state where it won't run any containers anymore, so it's not necessary to fully remove a pod to reuse devices. The advantage would be that one can still retrieve logs or inspect the pod object to determine what it did (exit code, termination message). I just don't see how an external controller can force a pod into that state, so we would have to go the same route as node tainting.

API design question: should apps explicitly have to opt into eviction?
I originally supposed: [...]

I think we can define device.attributes["kubernetes.io"].offline != "" as a check that, if true, means that the device cannot and/or should not be used. With that definition, not scheduling and evicting running pods seem like the right default behavior if a ResourceClaim doesn't list tolerations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Development

Successfully merging this pull request may close these issues.

4 participants