Pagination from cache KEP #5017

serathius · 2024-12-30T16:06:32Z

Create first draft of #4988 as provisional.

Draft PR for context kubernetes/kubernetes#128951

/cc @wojtek-t @deads2k @MadhavJivrajani @jpbetz

dims · 2024-12-30T16:18:01Z

MadhavJivrajani · 2025-01-02T19:08:42Z

keps/sig-api-machinery/4988-serve-pagination-from-cache/README.md

+As still some pagination requests will be delegated to etcd, we will monitor the
+success rate by measuring the pagination cache hit vs miss ratio.
+
+Consideration: Should we start respecting the limit parameter?


Not sure I understand - are we not respecting the limit parameter in the current iteration?

Currently API server doesn't respect limit when serving RV="0". https://github.com/kubernetes/kubernetes/blob/6746df77f2376c6bc1fd0de767d2a94e6bd6cec1/staging/src/k8s.io/apiserver/pkg/storage/cacher/cacher.go#L806-L818

I think we should consider re-enabling limit for consistency, however we need to better understand consequences. Impact on client/server when we cannot serve pagination from cache, like with L7 LB or pagination taking more than 75s.

I'm not worried about clients, user setting limit should be already prepared to handle pagination as it is required when not setting RV.

I'm actually also worried about clients - I think I've seen cases of people doing that and relying on the baviour of lack of pagination for RV=0.
I'm not saying it's a hard-no, but we need to figure out the story here.

That said, I would put it explicitly out-of-scope for this KEP and add that explicitly as future work.

Sounds good.

MadhavJivrajani · 2025-01-02T20:21:22Z

keps/sig-api-machinery/4988-serve-pagination-from-cache/README.md

+
+For setups with L4 loadbalancer apiserver can be configured with Goaway, which
+requests client reconnects periodically, however per request probability should
+be configured around 0.1%.


Is there a reason for 0.1% here?

0.1% is the recommended to configuration for GOAWAY kubernetes/kubernetes#88567

MadhavJivrajani · 2025-01-02T20:25:02Z

keps/sig-api-machinery/4988-serve-pagination-from-cache/README.md

+requests client reconnects periodically, however per request probability should
+be configured around 0.1%.
+
+For L7 loadbalancer the default algorithm usually is round-robin. For most LBs


Trying to better understand this, is the worst case here as follows (assume 3 API Servers A, B, C):

Client hits API Server A: assuming the rv is cached, snapshot is created on receiving a LIST request with a limit parameter set

Client hits API Server B: no snapshot present, we delegate to etcd

Client hits API Server C: no snapshot present, we delegate to etcd

so the performance degenerates to the current situation without cached pagination with slight improvement for (1)? And if (1) also delegates to etcd in case the rv isn't cached, then perf degenerates to the current scenario of no cached pagination?

This ^ is assuming all are on a version that has support for pagination from the cache. If there is one server that is on a minor version which does not have support, my understanding is that that would again be delegated to etcd

Right, there is no regression, assuming that:

We will delegate continue requests, that we don't have cached responses for, to etcd.

We will not change how API server doesn't respect limit for RV="0".

Signed-off-by: Madhav Jivrajani <[email protected]>

kep-4988: flesh out cached pagination procedure

k8s-ci-robot · 2025-01-08T11:04:47Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: serathius
Once this PR has been reviewed and has the lgtm label, please assign apelisse for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/sig-api-machinery/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wojtek-t · 2025-01-10T07:42:07Z

keps/sig-api-machinery/4988-serve-pagination-from-cache/README.md

+  - Since resourceVersions provide a global logical clock sequencing all events in the cluster, a snapshot
+    of the watchCache for this resourceVersion is retrieved using the resourceVersion as the key.
+  - The corresponding snapshot may not be present in the following 2 scenarios at an API Server:
+    - Snapshot has been cleaned up due to the 75s TTL (see below).


nit: we switched that recently, with 75s still being default, but it now depends on request timeouts

wojtek-t · 2025-01-10T07:44:29Z

keps/sig-api-machinery/4988-serve-pagination-from-cache/README.md

+As still some pagination requests will be delegated to etcd, we will monitor the
+success rate by measuring the pagination cache hit vs miss ratio.
+
+Consideration: Should we start respecting the limit parameter?


I'm actually also worried about clients - I think I've seen cases of people doing that and relying on the baviour of lack of pagination for RV=0.
I'm not saying it's a hard-no, but we need to figure out the story here.

That said, I would put it explicitly out-of-scope for this KEP and add that explicitly as future work.

keps/sig-api-machinery/4988-serve-pagination-from-cache/README.md

wojtek-t · 2025-01-10T07:50:38Z

keps/sig-api-machinery/4988-serve-pagination-from-cache/README.md

+
+#### Memory overhead
+
+No, B-tree only store pointers the actual objects, not the object themselves.


Well - there is some overhead (as you also write below) - you just claim it's small (can we somehow quantify small?)

wojtek-t · 2025-01-10T07:51:36Z

keps/sig-api-machinery/4988-serve-pagination-from-cache/README.md

+For L7 loadbalancer the default algorithm usually is round-robin. For most LBs
+it should be possible to switch the algorithm to be based on source IP hash.
+Even if that is not possible, stored snapshots will never be used and user will
+not be able to benefit from the feature.


I'm not sure we can expect providers to change their LB configuration...

My main point here was that we expect only minimal overhead, while users of L7 LB can can opt in via a common configuration option.

serathius · 2025-01-14T13:03:52Z

I run the scalability tests to measure overhead of clone. Scalability tests are a good as they don't use pagination nor exact request. I used kubernetes/kubernetes#126855 which clones the storage on each request. The results are good:

Overhead based on profiles collected during scalability tests:

Additional 7GB of object allocations, which accounts for 0.2% of allocations.
Additional 300MB of memory used, which accounts for 1.3% of memory used in scalability test.

The overhead is small enough that is within normal variance of memory usage during the test. The are some noticeable increases in request latency however I they are still far from SLO and could be due to high variance in results.

LIST pods with namespace 99%ile increased from 1s to 1.2s (within variance https://perf-dash.k8s.io/#/?jobname=gce-5000Nodes&metriccategoryname=APIServer&metricname=LoadResponsiveness_Prometheus&Resource=pods&Scope=namespace&Subresource=&Verb=LIST)
DELETE pods 99%ile increased from 170ms to 300ms https://perf-dash.k8s.io/#/?jobname=gce-5000Nodes&metriccategoryname=APIServer&metricname=LoadResponsiveness_Prometheus&Resource=pods&Scope=resource&Subresource=&Verb=DELETE
and some other single object operation have seen latency increase.

If we account for high variance of latency in scalability tests and look at profile differences only, we can estimate the expected overhead of keeping all store snapshots in the watchcache to be below 2% of memory.

wojtek-t · 2025-01-14T13:15:20Z

Are you looking at LoadResponsiveness_Prometheus or LoadResponsiveness_PrometheusSimple for latencies?
If you got 170ms for delete pods in base, it's probably the former, but it also has much higher variance.
What are is the comparison for the later?

https://perf-dash.k8s.io/#/?jobname=gce-5000Nodes&metriccategoryname=APIServer&metricname=LoadResponsiveness_Prometheus&Resource=pods&Scope=resource&Subresource=&Verb=DELETE
https://perf-dash.k8s.io/#/?jobname=gce-5000Nodes&metriccategoryname=APIServer&metricname=LoadResponsiveness_PrometheusSimple&Resource=pods&Scope=resource&Subresource=&Verb=DELETE

serathius · 2025-01-14T13:21:32Z

I looked at the LoadResponsiveness_Prometheus. For PrometheusSimple the latencies match aside of some anomalies like GET services, however they also seem very variadic in PrometheusSimple. https://perf-dash.k8s.io/#/?jobname=gce-5000Nodes&metriccategoryname=APIServer&metricname=LoadResponsiveness_PrometheusSimple&Resource=services&Scope=resource&Subresource=&Verb=GET

wojtek-t · 2025-01-14T13:56:06Z

I would focus on PrometheusSimple as something that is much more predictible/repeatable.
If those match, and the overhead as you wrote is fairly small (I would be interested in observing how it looks also on small scale), then this solution is much preferable to me (even if in the first step we will only support pagination and nothing else).

jpbetz · 2025-01-14T19:09:40Z

keps/sig-api-machinery/4988-serve-pagination-from-cache/README.md

+B-tree snapshots to serve paginated lists.
+
+Mechanism:
+1. **Snapshot Creation:** When a paginated list request (with a limit parameter


In a HA configuration where there are multiple apiservers and client requests are load balanced across those apiservers, is the idea that each apiserver creates the snapshot on the first paginated request it receives, even if the request is for a subsequent page?

Not really - if I get only a subsequent request for n-th page, I simply forward it to etcd....

... unless we go with the approach I'm proposing instead: #5017 (comment)

jpbetz · 2025-01-14T19:11:20Z

keps/sig-api-machinery/4988-serve-pagination-from-cache/README.md

+4. **Snapshot Cleanup:** Snapshots will be subject to a Time-To-Live (TTL)
+   mechanism. We will reuse the existing watch event cleanup logic, which has a
+   75s TTL. This ensures that snapshots don't accumulate indefinitely.


If a snapshot is missing for a request (either cleaned up, or otherwise), is it recreated or does the request fail?

Based on the sentence below, perhaps it falls through to etcd and we get whatever etcd does with it?

As David wrote.

deads2k · 2025-01-14T22:34:57Z

keps/sig-api-machinery/4988-serve-pagination-from-cache/README.md

+
+### Non-Goals
+
+- Serve `resourceVersion="N"` request from watch cache


So I'm clear, this means that a paginated list from RV=N (which is valid I think based on docs: https://kubernetes.io/docs/reference/using-api/api-concepts/#semantics-for-get-and-list ) will not be supported?

Not initially - depending on the path we take here, if we take #5017 (comment) it will be easily extendable to support it later.

We just wanted to reduce the scope initially, but if you prefer to have it supported from the beginning, we can change it.

deads2k · 2025-01-14T22:36:53Z

keps/sig-api-machinery/4988-serve-pagination-from-cache/README.md

+   arrives, the API server will:
+  - Extract the resourceVersion from the continue token. 
+  - Since resourceVersions provide a global logical clock sequencing all events in the cluster, a snapshot
+    of the watchCache for this resourceVersion is retrieved using the resourceVersion as the key.


if the request with the continue token goes to a different kube-apiserver than the initial list, how will this lookup succeed?

it's forwarded to etcd... unless we go with the approach that I'm proposing: #5017 (comment)

deads2k · 2025-01-14T22:42:21Z

keps/sig-api-machinery/4988-serve-pagination-from-cache/README.md

+
+- `k8s/apiserver/pkg/storage/cache`: `2024-12-12` - `<test coverage>`
+
+##### Integration tests


let's start a list of must-have integration tests. I definitely want to see the handling of multiple kube-apiservers where the initial list and the continue list go to different kube-apiservers and we correctly fallback to etcd and the result functions.

deads2k · 2025-01-14T22:43:23Z

keps/sig-api-machinery/4988-serve-pagination-from-cache/README.md

+
+[API call latency SLI](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/api_call_latency.md)
+
+###### Are there any missing metrics that would be useful to have to improve observability of this feature?


Given the importance of cache misses for continue tokens, I'd like to have those metrics available in alpha to inform going to beta.

deads2k · 2025-01-14T22:48:05Z

keps/sig-api-machinery/4988-serve-pagination-from-cache/README.md

+
+No
+
+### Troubleshooting


How can we check in the field whether the response from the cache exactly matches the response from etcd?

Pagination from cache KEP

a802612

k8s-ci-robot requested review from deads2k, jpbetz, MadhavJivrajani and wojtek-t December 30, 2024 16:06

MadhavJivrajani reviewed Jan 2, 2025

View reviewed changes

MadhavJivrajani and others added 2 commits January 3, 2025 12:39

kep-4988: flesh out cached pagination procedure

f45eb42

Signed-off-by: Madhav Jivrajani <[email protected]>

Merge pull request #1 from MadhavJivrajani/kep-4988-flesh-out

5921e48

kep-4988: flesh out cached pagination procedure

wojtek-t reviewed Jan 10, 2025

View reviewed changes

wojtek-t self-assigned this Jan 10, 2025

jpbetz reviewed Jan 14, 2025

View reviewed changes

deads2k reviewed Jan 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pagination from cache KEP #5017

Pagination from cache KEP #5017

serathius commented Dec 30, 2024 •

edited

Loading

dims commented Dec 30, 2024 •

edited

Loading

MadhavJivrajani Jan 2, 2025

serathius Jan 3, 2025

wojtek-t Jan 10, 2025

serathius Jan 10, 2025

MadhavJivrajani Jan 2, 2025

serathius Jan 3, 2025

MadhavJivrajani Jan 2, 2025 •

edited

Loading

MadhavJivrajani Jan 2, 2025

serathius Jan 3, 2025

k8s-ci-robot commented Jan 8, 2025

wojtek-t Jan 10, 2025

serathius Jan 10, 2025

wojtek-t Jan 10, 2025

wojtek-t Jan 10, 2025

wojtek-t Jan 10, 2025

serathius Jan 10, 2025

serathius commented Jan 14, 2025

wojtek-t commented Jan 14, 2025

serathius commented Jan 14, 2025

wojtek-t commented Jan 14, 2025

jpbetz Jan 14, 2025

wojtek-t Jan 15, 2025

jpbetz Jan 14, 2025

deads2k Jan 14, 2025

wojtek-t Jan 15, 2025

deads2k Jan 14, 2025

wojtek-t Jan 15, 2025

deads2k Jan 14, 2025

wojtek-t Jan 15, 2025

deads2k Jan 14, 2025

deads2k Jan 14, 2025

deads2k Jan 14, 2025


		#### Memory overhead

		No, B-tree only store pointers the actual objects, not the object themselves.


		### Non-Goals

		- Serve `resourceVersion="N"` request from watch cache


		- `k8s/apiserver/pkg/storage/cache`: `2024-12-12` - `<test coverage>`

		##### Integration tests


		[API call latency SLI](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/api_call_latency.md)

		###### Are there any missing metrics that would be useful to have to improve observability of this feature?

Pagination from cache KEP #5017

Are you sure you want to change the base?

Pagination from cache KEP #5017

Conversation

serathius commented Dec 30, 2024 • edited Loading

dims commented Dec 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MadhavJivrajani Jan 2, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jan 8, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

serathius commented Jan 14, 2025

wojtek-t commented Jan 14, 2025

serathius commented Jan 14, 2025

wojtek-t commented Jan 14, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

serathius commented Dec 30, 2024 •

edited

Loading

dims commented Dec 30, 2024 •

edited

Loading

MadhavJivrajani Jan 2, 2025 •

edited

Loading