Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Add retries for pod and node fetching to handle transient errors #4543

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

romilbhardwaj
Copy link
Collaborator

K8s API server may have transient issues, in which case we should retry instead of failing on the first attempt.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Manually using a bad k8s context and ensuring retries are run.

if attempt < max_retries - 1:
logger.debug(f'Kubernetes API call {func.__name__} '
f'failed with {str(e)}. Retrying...')
time.sleep(retry_interval)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add some kind of backoff? We have observed that on AWS, when there are a lot requests going to the same metadata server, it may always fail even with retry, cc'ing @cg505

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any unique insight to k8s, but I think backoffs are probably a good idea anywhere we are retrying API calls.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good idea - added a short backoff. Since these methods are used in user-facing methods (e.g., sky show-gpus), we want to keep the backoff short to surface errors fast.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @romilbhardwaj! LGTM.



def _retry_on_error(max_retries=DEFAULT_MAX_RETRIES,
retry_interval=DEFAULT_RETRY_INTERVAL_SECONDS,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
retry_interval=DEFAULT_RETRY_INTERVAL_SECONDS,
retry_interval_seconds=DEFAULT_RETRY_INTERVAL_SECONDS,

Comment on lines +527 to +528
nodes = kubernetes.core_api(context).list_node(
_request_timeout=kubernetes.API_TIMEOUT).items
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we already have the request_timeout set here. Is this not covering all the cases the retry_on_error can handle?

@@ -105,6 +106,75 @@

logger = sky_logging.init_logger(__name__)

# Default retry settings for Kubernetes API calls
DEFAULT_MAX_RETRIES = 3
DEFAULT_RETRY_INTERVAL_SECONDS = 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can reduce the interval if need to be less than 1 second : )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants