[feature] Deploy Omni in a Highly available way #835

rothgar · 2025-01-14T17:51:58Z

Problem Description

Currently Omni is only deployed as a single container instance. This works for small installations and situations where availability can rely on fast recovery (eg kubernetes pod).

Omni should also have a way to be deployed in a highly available manner where fast recovery isn't good enough.

Solution

Some things a HA Omni installation would need:

External etcd - this is already available
Shared secrets (eg certificates) - this can be achieved with shared storage
Web service and API load balancing - this is already possible
Wireguard tunnel/Siderolink load balancing

Alternative Solutions

No response

Notes

There may be other things that need to be implemented, but siderolink load balancing is the only outstanding issue I'm aware of.

We would need to document how to deploy Omni in HA mode when it is available.

utkuozdemir · 2025-01-15T10:26:36Z

This is probably pretty big, as we do a lot of in-memory stuff like ephemeral resources, caches and so on. I guess it most possibly would need an in-memory cache like Redis, and serious rework of a lot of places.

smira · 2025-01-15T10:46:19Z

In fact, this is not an Omni issue as it is at the moment, but rather a deployment issue.

Omni supports etcd HA, and you can run multiple Omni instances simultaneously. Only one of the instances will be active at any moment of time, that is it will pass readiness checks (it will listen on API port, for example).

The traffic from the external Wireguard port should be delivered to the active Omni instance by means of Kubernetes (or any other LB/ingress), and if the active instance changes, it should be diverted to a new active one. (Same with HTTP ingress, but it should work out of the box, as only active instance passes readiness checks).

There's nothing we can do on Omni side. Wireguard traffic can only be terminated at a single endpoint, there's no concept of HA Wireguard, as the packet itself can't be attached to any kind of session (contents are encrypted).

utkuozdemir · 2025-01-15T10:50:26Z

I'm not sure if a single active Omni at any time + standby Omnis would count as HA.

We could maybe move the Wireguard termination to a separate process, as a dedicated singleton service, and aim to get everything else true HA.

smira · 2025-01-15T10:55:37Z

HA means that it can recover from failures, it doesn't necessarily mean that it runs in multiple instances. As Wireguard termination stays as a SPOF, it doesn't make it HA on other hand.

smira · 2025-01-15T11:30:06Z

In fact, any form of communication between Omni instances (vs. communication via etcd) would introduce tons of potential hard to debug failures. If you want a mesh of Omnis, any broken link between the two would lead to hard to debug failures. So HA shouldn't also mean overcomplication.

utkuozdemir · 2025-01-15T11:32:26Z

Yes, I wouldn't suggest a mesh, but ideally, independent, active instances. Yes, wg would be a SPOF, but when it is down, Omni would be degraded, but functional.

smira · 2025-01-15T11:39:40Z

I'm not quite sure what Omni could do if Wireguard is down? It can't actually neither talk to the machines, nor the machines can talk to Omni. So it's almost same as fully down, except that the UI is not down.

In the same way you could run two instances like a described above and terminate Wireguard at one only.

FedotCompot · 2025-01-15T13:55:17Z

It would be nice to be able to initialize Omni by configuring it like in talos factory that would create a Talos iso with a k8s cluster preconfigured with an omni namespace where balancing, http and wireguard routing are already configured so you are not required to have any machines external to your Omni (Talos) cluster

rothgar · 2025-01-15T17:24:42Z

Does an wireguard floating VIP between Omni instances work for wg load balancing? Something like this article might be possible outside of Omni, but I'm not sure if machines would reconnect gracefully when the VIP switches.

An active/passive configuration is an OK option as long as state and failover are automatic. Customers are asking about solving the problem with single instance and slow recovery of Omni. We can ignore full active/active architecture for an initial design.

smira · 2025-01-16T09:57:04Z

What I meant above was kind of a "floating IP". There are many ways to configure things with various trade-offs in terms "how fast it would recover Wireguard connections", but the easiest that one could do is the following:

deploy HA etcd
assume we have 2 machines to run Omni: A and B, with public IPs IP(A) and IP(B)
there's an HTTP ingress which delivers HTTP traffic to either Omni(A) or Omni(B) depending on which one passes readiness checks
Omni(A) advertises its WG endpoint as IP(A), Omni(B) as IP(B)
Talos monitors SideroLink connection status, so if it was connected e.g. to Omni(A) which went down, in approx. ~3 minutes it will discover that the connection is broken, and re-establish the connection going first via HTTP ingress (which will deliver request to live Omni(B) instance), then Omni(B) will advertise its own IP(B) to Talos, and Wireguard connection is re-established

The case above has relatively high latency to recover connections with Talos, but it's also the simplest possible I think.

Move involved would require sending traffic from a single public Wireguard IP to an active instance, and making sure Omni sees the public endpoint of Talos peer, then it can recover Wireguard connections really fast.

rothgar added this to Product Jan 14, 2025

rothgar moved this to Ideas in Product Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature] Deploy Omni in a Highly available way #835

[feature] Deploy Omni in a Highly available way #835

rothgar commented Jan 14, 2025

utkuozdemir commented Jan 15, 2025

smira commented Jan 15, 2025 •

edited

Loading

utkuozdemir commented Jan 15, 2025

smira commented Jan 15, 2025

smira commented Jan 15, 2025

utkuozdemir commented Jan 15, 2025

smira commented Jan 15, 2025 •

edited

Loading

FedotCompot commented Jan 15, 2025

rothgar commented Jan 15, 2025

smira commented Jan 16, 2025

[feature] Deploy Omni in a Highly available way #835

[feature] Deploy Omni in a Highly available way #835

Comments

rothgar commented Jan 14, 2025

Problem Description

Solution

Alternative Solutions

Notes

utkuozdemir commented Jan 15, 2025

smira commented Jan 15, 2025 • edited Loading

utkuozdemir commented Jan 15, 2025

smira commented Jan 15, 2025

smira commented Jan 15, 2025

utkuozdemir commented Jan 15, 2025

smira commented Jan 15, 2025 • edited Loading

FedotCompot commented Jan 15, 2025

rothgar commented Jan 15, 2025

smira commented Jan 16, 2025

smira commented Jan 15, 2025 •

edited

Loading

smira commented Jan 15, 2025 •

edited

Loading