-
Notifications
You must be signed in to change notification settings - Fork 551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[k8s] Add retries for pod and node fetching to handle transient errors #4543
base: master
Are you sure you want to change the base?
Conversation
sky/provision/kubernetes/utils.py
Outdated
if attempt < max_retries - 1: | ||
logger.debug(f'Kubernetes API call {func.__name__} ' | ||
f'failed with {str(e)}. Retrying...') | ||
time.sleep(retry_interval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to add some kind of backoff? We have observed that on AWS, when there are a lot requests going to the same metadata server, it may always fail even with retry, cc'ing @cg505
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any unique insight to k8s, but I think backoffs are probably a good idea anywhere we are retrying API calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good idea - added a short backoff. Since these methods are used in user-facing methods (e.g., sky show-gpus
), we want to keep the backoff short to surface errors fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @romilbhardwaj! LGTM.
|
||
|
||
def _retry_on_error(max_retries=DEFAULT_MAX_RETRIES, | ||
retry_interval=DEFAULT_RETRY_INTERVAL_SECONDS, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
retry_interval=DEFAULT_RETRY_INTERVAL_SECONDS, | |
retry_interval_seconds=DEFAULT_RETRY_INTERVAL_SECONDS, |
nodes = kubernetes.core_api(context).list_node( | ||
_request_timeout=kubernetes.API_TIMEOUT).items |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we already have the request_timeout
set here. Is this not covering all the cases the retry_on_error
can handle?
@@ -105,6 +106,75 @@ | |||
|
|||
logger = sky_logging.init_logger(__name__) | |||
|
|||
# Default retry settings for Kubernetes API calls | |||
DEFAULT_MAX_RETRIES = 3 | |||
DEFAULT_RETRY_INTERVAL_SECONDS = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can reduce the interval if need to be less than 1 second : )
K8s API server may have transient issues, in which case we should retry instead of failing on the first attempt.
Tested (run the relevant ones):
bash format.sh