Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] Deploy Omni in a Highly available way #835

Open
rothgar opened this issue Jan 14, 2025 · 10 comments
Open

[feature] Deploy Omni in a Highly available way #835

rothgar opened this issue Jan 14, 2025 · 10 comments

Comments

@rothgar
Copy link
Member

rothgar commented Jan 14, 2025

Problem Description

Currently Omni is only deployed as a single container instance. This works for small installations and situations where availability can rely on fast recovery (eg kubernetes pod).

Omni should also have a way to be deployed in a highly available manner where fast recovery isn't good enough.

Solution

Some things a HA Omni installation would need:

  • External etcd - this is already available
  • Shared secrets (eg certificates) - this can be achieved with shared storage
  • Web service and API load balancing - this is already possible
  • Wireguard tunnel/Siderolink load balancing

Alternative Solutions

No response

Notes

There may be other things that need to be implemented, but siderolink load balancing is the only outstanding issue I'm aware of.

We would need to document how to deploy Omni in HA mode when it is available.

@rothgar rothgar added this to Product Jan 14, 2025
@rothgar rothgar moved this to Ideas in Product Jan 14, 2025
@utkuozdemir
Copy link
Member

This is probably pretty big, as we do a lot of in-memory stuff like ephemeral resources, caches and so on. I guess it most possibly would need an in-memory cache like Redis, and serious rework of a lot of places.

@smira
Copy link
Member

smira commented Jan 15, 2025

In fact, this is not an Omni issue as it is at the moment, but rather a deployment issue.

Omni supports etcd HA, and you can run multiple Omni instances simultaneously. Only one of the instances will be active at any moment of time, that is it will pass readiness checks (it will listen on API port, for example).

The traffic from the external Wireguard port should be delivered to the active Omni instance by means of Kubernetes (or any other LB/ingress), and if the active instance changes, it should be diverted to a new active one. (Same with HTTP ingress, but it should work out of the box, as only active instance passes readiness checks).

There's nothing we can do on Omni side. Wireguard traffic can only be terminated at a single endpoint, there's no concept of HA Wireguard, as the packet itself can't be attached to any kind of session (contents are encrypted).

@utkuozdemir
Copy link
Member

I'm not sure if a single active Omni at any time + standby Omnis would count as HA.

We could maybe move the Wireguard termination to a separate process, as a dedicated singleton service, and aim to get everything else true HA.

@smira
Copy link
Member

smira commented Jan 15, 2025

HA means that it can recover from failures, it doesn't necessarily mean that it runs in multiple instances. As Wireguard termination stays as a SPOF, it doesn't make it HA on other hand.

@smira
Copy link
Member

smira commented Jan 15, 2025

In fact, any form of communication between Omni instances (vs. communication via etcd) would introduce tons of potential hard to debug failures. If you want a mesh of Omnis, any broken link between the two would lead to hard to debug failures. So HA shouldn't also mean overcomplication.

@utkuozdemir
Copy link
Member

Yes, I wouldn't suggest a mesh, but ideally, independent, active instances. Yes, wg would be a SPOF, but when it is down, Omni would be degraded, but functional.

@smira
Copy link
Member

smira commented Jan 15, 2025

I'm not quite sure what Omni could do if Wireguard is down? It can't actually neither talk to the machines, nor the machines can talk to Omni. So it's almost same as fully down, except that the UI is not down.

In the same way you could run two instances like a described above and terminate Wireguard at one only.

@FedotCompot
Copy link

It would be nice to be able to initialize Omni by configuring it like in talos factory that would create a Talos iso with a k8s cluster preconfigured with an omni namespace where balancing, http and wireguard routing are already configured so you are not required to have any machines external to your Omni (Talos) cluster

@rothgar
Copy link
Member Author

rothgar commented Jan 15, 2025

Does an wireguard floating VIP between Omni instances work for wg load balancing? Something like this article might be possible outside of Omni, but I'm not sure if machines would reconnect gracefully when the VIP switches.

An active/passive configuration is an OK option as long as state and failover are automatic. Customers are asking about solving the problem with single instance and slow recovery of Omni. We can ignore full active/active architecture for an initial design.

@smira
Copy link
Member

smira commented Jan 16, 2025

What I meant above was kind of a "floating IP". There are many ways to configure things with various trade-offs in terms "how fast it would recover Wireguard connections", but the easiest that one could do is the following:

  • deploy HA etcd
  • assume we have 2 machines to run Omni: A and B, with public IPs IP(A) and IP(B)
  • there's an HTTP ingress which delivers HTTP traffic to either Omni(A) or Omni(B) depending on which one passes readiness checks
  • Omni(A) advertises its WG endpoint as IP(A), Omni(B) as IP(B)
  • Talos monitors SideroLink connection status, so if it was connected e.g. to Omni(A) which went down, in approx. ~3 minutes it will discover that the connection is broken, and re-establish the connection going first via HTTP ingress (which will deliver request to live Omni(B) instance), then Omni(B) will advertise its own IP(B) to Talos, and Wireguard connection is re-established

The case above has relatively high latency to recover connections with Talos, but it's also the simplest possible I think.

Move involved would require sending traffic from a single public Wireguard IP to an active instance, and making sure Omni sees the public endpoint of Talos peer, then it can recover Wireguard connections really fast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Ideas
Development

No branches or pull requests

4 participants