Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CA DRA: integrate DeltaSnapshotStore with dynamicresources.Snapshot #7681

Open
towca opened this issue Jan 9, 2025 · 0 comments
Open

CA DRA: integrate DeltaSnapshotStore with dynamicresources.Snapshot #7681

towca opened this issue Jan 9, 2025 · 0 comments
Labels
area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. kind/feature Categorizes issue or PR as related to a new feature. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.

Comments

@towca
Copy link
Collaborator

towca commented Jan 9, 2025

Which component are you using?:

/area cluster-autoscaler
/area core-autoscaler
/wg device-management

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:

There are 2 ClusterSnapshotStore implementations with very different performance characteristics:

  • BasicSnapshotStore is a very simple, reference implementation that clones the whole state during Fork(). It's easy to understand and can be used e.g. in tests, but the complexity of operations is not optimized for the typical usage patterns during a Cluster Autoscaler loop. Not really intended for production use because of this.
  • DeltaSnapshotStore is a much more complex implementation, that branches and keeps deltas separately for every Fork(). The complexity of operations is optimized for typical Cluster Autoscaler usage patterns. This is the de-facto production implementation.

In order for DRA autoscaling to work, a ClusterSnapshotStore implementation has to integrate with dynamicresources.Snapshot. This means correctly handling the DRA snapshot during Fork()/Commit()/Revert().

For DRA autoscaling MVP, only BasicSnapshotStore was integrated with dynamicresources.Snapshot. It's pretty trivial in this case - we just need dynamicresources.Snapshot.Clone(). This was enough to test the MVP, but for production use we need to integrate DeltaSnapshotStore as well.

Describe the solution you'd like.:

  • Add the ability to chain multiple dynamicresources.Snapshot objects, where each object represents a delta from the previous one.
  • The chain can be queried/modified in the same way as a single dynamicresources.Snapshot. Queries fall back the chain and are expensive, modifications are applied to the top object in the chain and are cheap.
  • The chain can be consolidated into a single dynamicresources.Snapshot.
  • DeltaSnapshotStore keeps a chain of dynamicresources.Snapshot objects, and modifies the chain in the same way as the NodeInfo storage chain during Fork()/Commit()/Revert().
  • The chain is probably easiest to implement by turning dynamicresources.Snapshot into an interface and introducing another "DeltaChain" implementation that uses the current one internally.

Describe any alternative solutions you've considered.:

IMO we should avoid having two completely separate Basic/Delta implementations for dynamicresources.Snapshot like we do for ClusterSnapshotStore. This pattern leads to duplicating large portions of non-trivial code and extending ClusterSnapshotStore is more painful than it should because of it. The Delta/Chain implementation should use the Basic one internally instead.

Additional context.:

This is a part of Dynamic Resource Allocation (DRA) support in Cluster Autoscaler. An MVP of the support was implemented in #7530 (with the whole implementation tracked in kubernetes/kubernetes#118612). There are a number of post-MVP follow-ups to be addressed before DRA autoscaling is ready for production use - this is one of them.

@towca towca added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 9, 2025
@k8s-ci-robot k8s-ci-robot added area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. wg/device-management Categorizes an issue or PR as relevant to WG Device Management. labels Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. kind/feature Categorizes issue or PR as related to a new feature. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
None yet
Development

No branches or pull requests

2 participants