[k8s] Add retries for pod and node fetching to handle transient errors #4543

romilbhardwaj · 2025-01-07T21:58:17Z

K8s API server may have transient issues, in which case we should retry instead of failing on the first attempt.

Tested (run the relevant ones):

Code formatting: bash format.sh
Manually using a bad k8s context and ensuring retries are run.

Michaelvll · 2025-01-15T02:31:08Z

sky/provision/kubernetes/utils.py

+                    if attempt < max_retries - 1:
+                        logger.debug(f'Kubernetes API call {func.__name__} '
+                                     f'failed with {str(e)}. Retrying...')
+                        time.sleep(retry_interval)


Do we want to add some kind of backoff? We have observed that on AWS, when there are a lot requests going to the same metadata server, it may always fail even with retry, cc'ing @cg505

I don't have any unique insight to k8s, but I think backoffs are probably a good idea anywhere we are retrying API calls.

Ah good idea - added a short backoff. Since these methods are used in user-facing methods (e.g., sky show-gpus), we want to keep the backoff short to surface errors fast.

Michaelvll

Thanks @romilbhardwaj! LGTM.

Michaelvll · 2025-01-15T20:26:48Z

sky/provision/kubernetes/utils.py

+
+
+def _retry_on_error(max_retries=DEFAULT_MAX_RETRIES,
+                    retry_interval=DEFAULT_RETRY_INTERVAL_SECONDS,


Suggested change

retry_interval=DEFAULT_RETRY_INTERVAL_SECONDS,

retry_interval_seconds=DEFAULT_RETRY_INTERVAL_SECONDS,

Michaelvll · 2025-01-15T20:29:13Z

sky/provision/kubernetes/utils.py

+    nodes = kubernetes.core_api(context).list_node(
+        _request_timeout=kubernetes.API_TIMEOUT).items


Seems we already have the request_timeout set here. Is this not covering all the cases the retry_on_error can handle?

Michaelvll · 2025-01-15T20:30:04Z

sky/provision/kubernetes/utils.py

@@ -105,6 +106,75 @@

 logger = sky_logging.init_logger(__name__)

+# Default retry settings for Kubernetes API calls
+DEFAULT_MAX_RETRIES = 3
+DEFAULT_RETRY_INTERVAL_SECONDS = 1


We can reduce the interval if need to be less than 1 second : )

romilbhardwaj added 3 commits January 7, 2025 13:29

wip

a7a1263

add debug logging

c07636a

lint

13c1c4f

romilbhardwaj requested a review from Michaelvll January 15, 2025 01:52

Michaelvll reviewed Jan 15, 2025

View reviewed changes

romilbhardwaj added 2 commits January 14, 2025 20:20

Add backoff

d7ab6b6

lint

b00ac77

romilbhardwaj requested a review from Michaelvll January 15, 2025 04:31

Michaelvll approved these changes Jan 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[k8s] Add retries for pod and node fetching to handle transient errors #4543

[k8s] Add retries for pod and node fetching to handle transient errors #4543

romilbhardwaj commented Jan 7, 2025

Michaelvll Jan 15, 2025

cg505 Jan 15, 2025

romilbhardwaj Jan 15, 2025

Michaelvll left a comment

Michaelvll Jan 15, 2025

Michaelvll Jan 15, 2025

Michaelvll Jan 15, 2025



		def _retry_on_error(max_retries=DEFAULT_MAX_RETRIES,
		retry_interval=DEFAULT_RETRY_INTERVAL_SECONDS,

	retry_interval=DEFAULT_RETRY_INTERVAL_SECONDS,
	retry_interval_seconds=DEFAULT_RETRY_INTERVAL_SECONDS,

		nodes = kubernetes.core_api(context).list_node(
		_request_timeout=kubernetes.API_TIMEOUT).items

[k8s] Add retries for pod and node fetching to handle transient errors #4543

Are you sure you want to change the base?

[k8s] Add retries for pod and node fetching to handle transient errors #4543

Conversation

romilbhardwaj commented Jan 7, 2025

Michaelvll Jan 15, 2025

Choose a reason for hiding this comment

cg505 Jan 15, 2025

Choose a reason for hiding this comment

romilbhardwaj Jan 15, 2025

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll Jan 15, 2025

Choose a reason for hiding this comment

Michaelvll Jan 15, 2025

Choose a reason for hiding this comment

Michaelvll Jan 15, 2025

Choose a reason for hiding this comment