CA DRA: integrate DeltaSnapshotStore with dynamicresources.Snapshot #7681

towca · 2025-01-09T14:45:33Z

Which component are you using?:

/area cluster-autoscaler
/area core-autoscaler
/wg device-management

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:

There are 2 ClusterSnapshotStore implementations with very different performance characteristics:

BasicSnapshotStore is a very simple, reference implementation that clones the whole state during Fork(). It's easy to understand and can be used e.g. in tests, but the complexity of operations is not optimized for the typical usage patterns during a Cluster Autoscaler loop. Not really intended for production use because of this.
DeltaSnapshotStore is a much more complex implementation, that branches and keeps deltas separately for every Fork(). The complexity of operations is optimized for typical Cluster Autoscaler usage patterns. This is the de-facto production implementation.

In order for DRA autoscaling to work, a ClusterSnapshotStore implementation has to integrate with dynamicresources.Snapshot. This means correctly handling the DRA snapshot during Fork()/Commit()/Revert().

For DRA autoscaling MVP, only BasicSnapshotStore was integrated with dynamicresources.Snapshot. It's pretty trivial in this case - we just need dynamicresources.Snapshot.Clone(). This was enough to test the MVP, but for production use we need to integrate DeltaSnapshotStore as well.

Describe the solution you'd like.:

Add the ability to chain multiple dynamicresources.Snapshot objects, where each object represents a delta from the previous one.
The chain can be queried/modified in the same way as a single dynamicresources.Snapshot. Queries fall back the chain and are expensive, modifications are applied to the top object in the chain and are cheap.
The chain can be consolidated into a single dynamicresources.Snapshot.
DeltaSnapshotStore keeps a chain of dynamicresources.Snapshot objects, and modifies the chain in the same way as the NodeInfo storage chain during Fork()/Commit()/Revert().
The chain is probably easiest to implement by turning dynamicresources.Snapshot into an interface and introducing another "DeltaChain" implementation that uses the current one internally.

Describe any alternative solutions you've considered.:

IMO we should avoid having two completely separate Basic/Delta implementations for dynamicresources.Snapshot like we do for ClusterSnapshotStore. This pattern leads to duplicating large portions of non-trivial code and extending ClusterSnapshotStore is more painful than it should because of it. The Delta/Chain implementation should use the Basic one internally instead.

Additional context.:

This is a part of Dynamic Resource Allocation (DRA) support in Cluster Autoscaler. An MVP of the support was implemented in #7530 (with the whole implementation tracked in kubernetes/kubernetes#118612). There are a number of post-MVP follow-ups to be addressed before DRA autoscaling is ready for production use - this is one of them.

The text was updated successfully, but these errors were encountered:

towca added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 9, 2025

k8s-ci-robot added area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. wg/device-management Categorizes an issue or PR as relevant to WG Device Management. labels Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CA DRA: integrate DeltaSnapshotStore with dynamicresources.Snapshot #7681

CA DRA: integrate DeltaSnapshotStore with dynamicresources.Snapshot #7681

towca commented Jan 9, 2025

CA DRA: integrate DeltaSnapshotStore with dynamicresources.Snapshot #7681

CA DRA: integrate DeltaSnapshotStore with dynamicresources.Snapshot #7681

Comments

towca commented Jan 9, 2025