k8s controllers not HA

Bug #1849030 reported by Ian Booth
24
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Low
Unassigned

Bug Description

k8s controllers do not support enable-ha

We need a strategy to support HA k8s controller deployments

Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: Wishlist → Low
tags: added: expirebugs-bot
Revision history for this message
Camille Rodriguez (camille.rodriguez) wrote :

This bug should be marked of higher importance. We are using more and more k8s controllers for deploying applications in kubernetes and the fact that the juju controller in kubernetes is not HA makes the deployment of applications with juju risky

Revision history for this message
Andreea Munteanu (munteanuandreea) wrote :

In general, juju controller is always deployed HA. there's 3 copies deployed and that way, if one controller goes down, we can still control the cloud. That's how it's done for maas controllers, openstack controllers, etc. But it's not available for kubernetes, which means that if the controller pod becomes unresponsive, you lose all access to the cloud via juju. That means no more operations with juju, no control for upgrades/scaling out, etc.

In theory you can recover control over a deployment with a lost controller, by using a backup and restoring it, but in reality, Bootstack has told me that sometimes the only way is to redeploy the whole thing

Bootstack raised this concern when we started the xperi deployment and I'm sure they'll want to help put pressure on the juju team to get this item prioritized now that they are in charge of a few applications deployments (both kubeflow and kaftka

Revision history for this message
Andreea Munteanu (munteanuandreea) wrote :

This conversation is also a good reference: https://chat.canonical.com/canonical/pl/hnnq88fsj7notpxc7gyzs75yey

Please bear in mind that XPeri is already running in production and it can led to data loss

Revision history for this message
Diko Parvanov (dparv) wrote :

Not having a juju controller in HA presents a single point of failure - backups and restores have worked on a normal juju controllers, but I have never seen one done on a controller running in k8s. Having the controller go down, means we have to re-deploy everything - and if that is a kubeflow deployment that mean tearing down a production cloud. If it's COS - even if we lose some data - we can live with it, but not running production workloads.

If we can make sure that a pod on a worker gets destroyed with the controller data and it can be safely re-created by kubernetes in another pod, so it is fully operational and all deployed resources with juju on top of k8s are manageable - then enable-ha will not be required.

Revision history for this message
Ian Booth (wallyworld) wrote :

We will implement HA when the transition away from mongo to dqlite is done. The dqlite enablement is being done next cycle, so realistically it's at least 6 months away.

The controller database is stored on a PV which will be reattached to any replacement pod if/when the controller pod bounces and is rescheduled. eg you can manually delete the controller pod and the new one comes up and continues operating. There will be a slight glitch during this time but the transition will be fast and the worker agents will reconnect.

Are there specific production outages that have been observed or is it just a theoretical "there's only one controller pod and so that's a SPOF" type scenario?

Revision history for this message
Diko Parvanov (dparv) wrote :

For now it's just theoretical because we didn't have production customers running anything with k8s juju controllers. We are currently onboarding customers with COS on microk8s and kubeflow. But yes, I have tested the pod destruction and the controller was still operational with the new pod getting spawned. I am just concerned what will happen if during the controller being down a juju agents executes something, e.g. breaks relation/leadership data or corrupts a unit db.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.