-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature] Deploy Omni in a Highly available way #835
Comments
This is probably pretty big, as we do a lot of in-memory stuff like ephemeral resources, caches and so on. I guess it most possibly would need an in-memory cache like Redis, and serious rework of a lot of places. |
In fact, this is not an Omni issue as it is at the moment, but rather a deployment issue. Omni supports etcd HA, and you can run multiple Omni instances simultaneously. Only one of the instances will be active at any moment of time, that is it will pass readiness checks (it will listen on API port, for example). The traffic from the external Wireguard port should be delivered to the active Omni instance by means of Kubernetes (or any other LB/ingress), and if the active instance changes, it should be diverted to a new active one. (Same with HTTP ingress, but it should work out of the box, as only active instance passes readiness checks). There's nothing we can do on Omni side. Wireguard traffic can only be terminated at a single endpoint, there's no concept of HA Wireguard, as the packet itself can't be attached to any kind of session (contents are encrypted). |
I'm not sure if a single active Omni at any time + standby Omnis would count as HA. We could maybe move the Wireguard termination to a separate process, as a dedicated singleton service, and aim to get everything else true HA. |
HA means that it can recover from failures, it doesn't necessarily mean that it runs in multiple instances. As Wireguard termination stays as a SPOF, it doesn't make it HA on other hand. |
In fact, any form of communication between Omni instances (vs. communication via etcd) would introduce tons of potential hard to debug failures. If you want a mesh of Omnis, any broken link between the two would lead to hard to debug failures. So HA shouldn't also mean overcomplication. |
Yes, I wouldn't suggest a mesh, but ideally, independent, active instances. Yes, wg would be a SPOF, but when it is down, Omni would be degraded, but functional. |
I'm not quite sure what Omni could do if Wireguard is down? It can't actually neither talk to the machines, nor the machines can talk to Omni. So it's almost same as fully down, except that the UI is not down. In the same way you could run two instances like a described above and terminate Wireguard at one only. |
It would be nice to be able to initialize Omni by configuring it like in talos factory that would create a Talos iso with a k8s cluster preconfigured with an omni namespace where balancing, http and wireguard routing are already configured so you are not required to have any machines external to your Omni (Talos) cluster |
Does an wireguard floating VIP between Omni instances work for wg load balancing? Something like this article might be possible outside of Omni, but I'm not sure if machines would reconnect gracefully when the VIP switches. An active/passive configuration is an OK option as long as state and failover are automatic. Customers are asking about solving the problem with single instance and slow recovery of Omni. We can ignore full active/active architecture for an initial design. |
What I meant above was kind of a "floating IP". There are many ways to configure things with various trade-offs in terms "how fast it would recover Wireguard connections", but the easiest that one could do is the following:
The case above has relatively high latency to recover connections with Talos, but it's also the simplest possible I think. Move involved would require sending traffic from a single public Wireguard IP to an active instance, and making sure Omni sees the public endpoint of Talos peer, then it can recover Wireguard connections really fast. |
Problem Description
Currently Omni is only deployed as a single container instance. This works for small installations and situations where availability can rely on fast recovery (eg kubernetes pod).
Omni should also have a way to be deployed in a highly available manner where fast recovery isn't good enough.
Solution
Some things a HA Omni installation would need:
Alternative Solutions
No response
Notes
There may be other things that need to be implemented, but siderolink load balancing is the only outstanding issue I'm aware of.
We would need to document how to deploy Omni in HA mode when it is available.
The text was updated successfully, but these errors were encountered: