Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[loadbalancer] support https healthchecks and join nodes as down #2188

Open
simonostendorf opened this issue Oct 10, 2024 · 3 comments
Open
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@simonostendorf
Copy link
Contributor

/kind feature

Describe the solution you'd like

I have three feature enhancement requests that are related and are about the load balancer for the api-server:

  1. support https healthchecks

I think it would be useful to use HTTPS health checks instead of TCP, because the api-server might have a reachable port but is not healthy to receive requests. This can only be solved with an HTTPS health check (e.g. to the /readyz endpoint).

I would prefer to add configuration options for the loadbalancer healthcheck, so that the user can decide which settings to use (see below).

  1. join nodes as down until api-server is healthy

A known problem with the loadbalancer is that new nodes are joined as "online / up" and after the timeout and retries of the healthcheck the node is marked as down. This could lead to dropped requests to the api server as they may be redirected to the "up" node that is not yet up.

I would like to join nodes as down first (set adminStateUp to false), wait until they are ready and then mark them as up.
To do this, the "is the node ready" question must be solved which was a problem in the past.
Since this proposal has been added to the cluster-api and implemented in the kubeadm control-plane provider, it is now solvable. So currently this idea only works with the kubeadm control-plane, but since the new condition has been proposed to capi in general, I think it could be added to more providers soon.

The loadbalancer controller could evaluate the adminState, which would be used to join the node in this way:

  • If the loadbalancer has no members, add the first node as "up", otherwise the initial node registration may fail.
  • If the loadbalancer has members and the machine (important: this must be checked on the cluster-api machine object) has the APIServerPodHealthy condition, add the node as 'up'.
  • If the load balancer has members and the machine does not have the condition, add the node as "down".

If the current status of an lb member is "down", but the new status should be "up", it needs to be recreated (same logic as changing the ip of an lb member).

Since no breaking changes should be introduced for the user, I would add this feature behind a feature flag / setting.

  1. make healthcheck/settings configurable

I think it would be useful to add something like .spec.apiServerLoadBalancer.healthCheck to the OpenStackCluster after adding more customisation options for the lb, so the users can decide which settings they want to use.

This could be combined with the "no breaking changes" aspect from above, so something like .spec.apiServerLoadBalancer.joinNodesAsDown could also be added to the OpenStackCluster so that the users can decide per cluster if they want to use this feature or not.

When these new fields are added, they could default to the current values so that there is no change for the user.

example:

spec:
  apiServerLoadBalancer:
    joinNodesAsDown: true
    healthCheck:
      type: TCP # TCP or HTTPS
      delay: 10
      maxRetries: 5
      maxRetriesDown: 3
      timeout: 5
      urlPath: /ready
      expectedCodes: 200-299

Anything else you would like to add:

I have roughly tested these changes and they work as expected. If this feature request is discussed, I could add a PR for it.

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 10, 2024
@EmilienM
Copy link
Contributor

@mdbooth some ideas^^^ once we work on a load balancer controller.

@simonostendorf we have a strong desire to write a new controller that will handle different load balancer backends (octavia, no LB, bring your own LB, etc), so I suspect this issue is a great input for something that will be implemented at some point. I'm not sure whether we will change the existing API though (because we would have to maintain it since we're v1beta1) if at some point we create a new one for the LB controller.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 8, 2025
@simonostendorf
Copy link
Contributor Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
Status: Inbox
Development

No branches or pull requests

4 participants