Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CA DRA: support DaemonSet/static pods using DRA #7684

Open
towca opened this issue Jan 9, 2025 · 0 comments
Open

CA DRA: support DaemonSet/static pods using DRA #7684

towca opened this issue Jan 9, 2025 · 0 comments
Labels
area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. kind/feature Categorizes issue or PR as related to a new feature. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.

Comments

@towca
Copy link
Collaborator

towca commented Jan 9, 2025

Which component are you using?:

/area cluster-autoscaler
/area core-autoscaler
/wg device-management

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:

To ensure correct scheduling simulations, Cluster Autoscaler has to predict the exact DaemonSet/static pods that will be present on a new Node created from a given NodeGroup. This can be done in 2 different major ways:

  • If there are existing Nodes in the NodeGroup, they can be used as a starting point for figuring out how a new Node will look like. One such existing Node is picked randomly, copied, and sanitized into an empty, fresh Node. During the sanitization, all non-DS/static Pods are removed, and the remaining Pods are modified not to be identical to the originals (the name is changed, the UID is randomized, the new nodeName is set etc.).
  • If there are no existing Nodes in the NodeGroup, CloudProvider is responsible for providing a template NodeInfo with the expected Node and Pods. Then that NodeInfo is sanitized, similarly to the other case.

When the sanitized Node/Pods use DRA, we have to sanitize the relevant DRA objects as well.

Sanitizing/duplicating ResourceSlices is easy. When we duplicate a Node, we also duplicate all ResourceSlices that are local to that Node (on the assumption that creating a new Node will create new Node-local ResourceSlices). When sanitizing the slices, we change the names of the listed Device Pools - (Pool names have to be unique within a DRA driver, each Node has dedicated Pools for its local Devices) in addition to changing the usual Name, UID, NodeName.

It's not immediately clear how to sanitize DS/static Pods that reference ResourceClaims. For the DRA autoscaling MVP, we went with the following logic:

  • ResourceClaims not owned by the sanitized Pod (i.e. "shared") are not sanitized/duplicated at all. We only add the new Pod to the ReservedFor field of the shared claim when the fresh NodeInfo is added to the ClusterSnapshot (the claim is stored in dynamicresources.Snapshot inside ClusterSnapshot).
  • ResourceClaims owned by the sanitized Pod are sanitized and duplicated (on the assumption that new claims will appear in the cluster together with the new Pod). When the fresh NodeInfo is added to the ClusterSnapshot, the new ResourceClaims are added to its dynamicresources.Snapshot.
    • If the claim is not allocated, the sanitization is simple - just change the Name, UID, and OwnerReferences.
    • If the claim is allocated, we can only sanitize/duplicate it if all the allocated Devices are Node-local on the Node that is being sanitized/duplicated. This is because creating a new Node in the cluster will only create new Node-local Devices. If we duplicated an allocated claim with a non-Node-local Device, we'd have the same Device allocated in multiple claims - which won't happen in reality. If the claim is indeed fully Node-local, we sanitize the allocations by changing the Device Pool names in the allocations to match the new Pool names in the sanitized ResourceSlices.

The logic described above has at least the following caveats:

  • Autoscaling only works for NodeGroups where all DS/static pods don't use DRA, and NodeGroups where all DS/static pods using DRA are guaranteed to only have Node-local Devices allocated.
  • The sanitization logic has a concept of "forcing missing DS pods" during sanitization. CA goes through all DS pods in the cluster, checks which ones should be running on the Node, and if any aren't present (because e.g. the Node is too small and some other Pod got scheduled first) they are force-added to the sanitized Node. This logic doesn't take DRA into account in the MVP, so if the force-added Pods reference ResourceClaims it'll break because the necessary claims aren't added as well.
  • In the case where there are no existing Nodes in the NodeGroup and CloudProvider.TemplateNodeInfo() has to be called, all ResourceClaim allocations in the returned NodeInfo have to be provided by CloudProvider. The CloudProvider has to run scheduler predicates internally to obtain the allocations, or have them precomputed somehow. Precomputing will probably be impossible/cumbersome for more complex cases.

We should figure out if these limitations are important for DRA use-cases in practice. If they are, we need to remove them. If they aren't, we could start validating them.

Describe the solution you'd like.:

  • For each of the limitations listed above, figure out if removing it is important for DRA use-cases in practice.
  • If a given limitation is important for DRA autoscaling production readiness, eliminate it somehow.
    • Solving DS pods with non-Node-local claim allocations will be difficult.
    • We could solve forcing missing DS pods by adding the missing claims as well. We should be able to create the claims in-memory from the relevant ResourceClaimTemplate. Or perhaps sanitize them from a DS pod running on another Node?
    • We could make CloudProvider.TemplateNodeInfo() easier by running scheduler predicates for Pods referencing claims after obtaining the NodeInfo from CloudProvider. Then CloudProvider.TemplateNodeInfo() could just return unallocated claims which should be straightforward.
  • If a given limitation is not important, ideally we'd detect it and reject it during validation at some stage. If that's not possible, at minimum we need to have it clearly documented.

Additional context.:

This is a part of Dynamic Resource Allocation (DRA) support in Cluster Autoscaler. An MVP of the support was implemented in #7530 (with the whole implementation tracked in kubernetes/kubernetes#118612). There are a number of post-MVP follow-ups to be addressed before DRA autoscaling is ready for production use - this is one of them.

@towca towca added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 9, 2025
@k8s-ci-robot k8s-ci-robot added area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. wg/device-management Categorizes an issue or PR as relevant to WG Device Management. labels Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. kind/feature Categorizes issue or PR as related to a new feature. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
None yet
Development

No branches or pull requests

2 participants