CA DRA: integrate DeltaSnapshotStore with dynamicresources.Snapshot #7681
Labels
area/cluster-autoscaler
area/core-autoscaler
Denotes an issue that is related to the core autoscaler and is not specific to any provider.
kind/feature
Categorizes issue or PR as related to a new feature.
wg/device-management
Categorizes an issue or PR as relevant to WG Device Management.
Which component are you using?:
/area cluster-autoscaler
/area core-autoscaler
/wg device-management
Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:
There are 2
ClusterSnapshotStore
implementations with very different performance characteristics:BasicSnapshotStore
is a very simple, reference implementation that clones the whole state duringFork()
. It's easy to understand and can be used e.g. in tests, but the complexity of operations is not optimized for the typical usage patterns during a Cluster Autoscaler loop. Not really intended for production use because of this.DeltaSnapshotStore
is a much more complex implementation, that branches and keeps deltas separately for everyFork()
. The complexity of operations is optimized for typical Cluster Autoscaler usage patterns. This is the de-facto production implementation.In order for DRA autoscaling to work, a
ClusterSnapshotStore
implementation has to integrate withdynamicresources.Snapshot
. This means correctly handling the DRA snapshot duringFork()/Commit()/Revert()
.For DRA autoscaling MVP, only
BasicSnapshotStore
was integrated withdynamicresources.Snapshot
. It's pretty trivial in this case - we just needdynamicresources.Snapshot.Clone()
. This was enough to test the MVP, but for production use we need to integrateDeltaSnapshotStore
as well.Describe the solution you'd like.:
dynamicresources.Snapshot
objects, where each object represents a delta from the previous one.dynamicresources.Snapshot
. Queries fall back the chain and are expensive, modifications are applied to the top object in the chain and are cheap.dynamicresources.Snapshot
.DeltaSnapshotStore
keeps a chain ofdynamicresources.Snapshot
objects, and modifies the chain in the same way as the NodeInfo storage chain duringFork()/Commit()/Revert()
.dynamicresources.Snapshot
into an interface and introducing another "DeltaChain" implementation that uses the current one internally.Describe any alternative solutions you've considered.:
IMO we should avoid having two completely separate Basic/Delta implementations for
dynamicresources.Snapshot
like we do forClusterSnapshotStore
. This pattern leads to duplicating large portions of non-trivial code and extendingClusterSnapshotStore
is more painful than it should because of it. The Delta/Chain implementation should use the Basic one internally instead.Additional context.:
This is a part of Dynamic Resource Allocation (DRA) support in Cluster Autoscaler. An MVP of the support was implemented in #7530 (with the whole implementation tracked in kubernetes/kubernetes#118612). There are a number of post-MVP follow-ups to be addressed before DRA autoscaling is ready for production use - this is one of them.
The text was updated successfully, but these errors were encountered: