k8s service types reset after a reboot

Bug #2011814 reported by James Page
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Undecided
Unassigned
OpenStack Snap
Triaged
High
Unassigned

Bug Description

After a successful installation of microstack (sunbeam/edge channel) a reboot of the server running the install results in a broken deployment.

AFAICT the service type patching that traefik, ovn-relay and rabbitmq perform to advertise themselves externally as type LoadBalancer get reverted; traefik clears its gateway address from all relations and all services consider themselves unconfigured as a result.

Tags: k8s
Revision history for this message
James Page (james-page) wrote :
Download full text (3.4 KiB)

Services should look like this:

$ microk8s.kubectl get service --namespace openstack
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
modeloperator ClusterIP 10.152.183.82 <none> 17071/TCP 7m47s
certificate-authority ClusterIP 10.152.183.73 <none> 65535/TCP 7m43s
certificate-authority-endpoints ClusterIP None <none> <none> 7m43s
traefik-endpoints ClusterIP None <none> <none> 7m31s
rabbitmq-endpoints ClusterIP None <none> <none> 7m25s
nova-endpoints ClusterIP None <none> <none> 7m20s
horizon-endpoints ClusterIP None <none> <none> 7m18s
placement-endpoints ClusterIP None <none> <none> 7m15s
neutron-endpoints ClusterIP None <none> <none> 7m12s
glance-endpoints ClusterIP None <none> <none> 7m9s
mysql ClusterIP 10.152.183.151 <none> 65535/TCP 7m8s
mysql-endpoints ClusterIP None <none> <none> 7m3s
ovn-relay-endpoints ClusterIP None <none> <none> 6m59s
ovn-central-endpoints ClusterIP None <none> <none> 6m55s
keystone-endpoints ClusterIP None <none> <none> 6m48s
traefik LoadBalancer 10.152.183.57 10.177.200.170 80:30024/TCP,443:32396/TCP 7m35s
rabbitmq LoadBalancer 10.152.183.195 10.177.200.171 5672:31495/TCP,15672:31719/TCP 7m30s
horizon ClusterIP 10.152.183.225 <none> 80/TCP 7m22s
placement ClusterIP 10.152.183.118 <none> 8778/TCP 7m18s
neutron ClusterIP 10.152.183.240 <none> 9696/TCP 7m16s
nova ClusterIP 10.152.183.33 <none> 8774/TCP 7m24s
ovn-relay LoadBalancer 10.152.183.95 10.177.200.172 6642:31729/TCP 7m4s
glance ClusterIP 10.152.183.45 <none> 9292/TCP 7m12s
ovn-central ClusterIP 10.152.183.219 <none> 6641/TCP,6642/TCP 6m59s
keystone ClusterIP 10.152.183.226 <non...

Read more...

Revision history for this message
James Page (james-page) wrote :

Raising a bug task for Juju

Deployments are using the 3.1 track

tags: added: k8s
Revision history for this message
Juan M. Tirado (tiradojm) wrote :

Could we have the steps to reproduce it? Just to be sure we have the same scenario.

Changed in juju:
status: New → Triaged
Revision history for this message
James Page (james-page) wrote (last edit ):

snap install microk8s --channel 1.25-strict/sable
sudo microk8s enable dns hostpath-storage
sudo microk8s enable metallb <supply range of IP's to use>

bootstrap a juju controller microk8s

juju add-model testing
juju deploy --trust --channel 3.11/beta rabbitmq-k8s

the RabbitMQ charm will patch its service definition to LoadBalancer

microk8s.kubectl get service --namespace testing

reboot the machine

microk8s.kubectl get service --namespace testing

service definition will have reverted to ClusterIP.

James Page (james-page)
affects: snap-sunbeam → snap-openstack
Revision history for this message
James Page (james-page) wrote :

The K8S service patcher lib at v1 now also patches the charms service definition on an update-status hook execution - this workaround this issue.

{22.03,yoga}/edge charms should get this/have this already.

James Page (james-page)
Changed in snap-openstack:
status: New → Triaged
importance: Undecided → High
Revision history for this message
James Page (james-page) wrote :

The traefik charm still has this issue and it won't patch its service definition in update-status calls.

I've seen this in two deployments where no restarts happened where this situation occurred.

Revision history for this message
Giuseppe Petralia (peppepetra) wrote :

This bug also affects COS deployment.

We find from time to time traefik blocked on waiting for external IP resulting in COS not being available.

The workaround is to delete the traefik pod to be redeployed with the external ip available.

Revision history for this message
John A Meinel (jameinel) wrote :

Do we have a theory as to why "traefik, ovn-relay and rabbitmq perform to advertise themselves externally as type LoadBalancer get reverted"

Is this Traefik patching its service, and then post-reboot juju rewrites this on startup to remove the patch, but the charm is missing a hook to be able to put it back?

Is this something about how the pod comes back up post reboot?

The fact that the "workaround is to delete the traefik pod" sounds like something where the steps that Juju is doing might be correct (otherwise the Traefik pod would be configured incorrectly.)

I don't have deep insight into what is happening.

As for the reproduction steps:

snap install microk8s --channel 1.25-strict/sable
sudo microk8s enable dns hostpath-storage
sudo microk8s enable metallb <supply range of IP's to use>

^ what range of IP's are selected. Does it need to be host routable, externally routable, etc?
Can you test this in LXD containers, do you need VMs, do you need host machines?

I can see in James Page's examples that there are 2 IP ranges:
 traefik LoadBalancer 10.152.183.57 10.177.200.170 80:30024/TCP,443:32396/TCP 7m35s

I'm guessing from everything else that 10.152.183.* is inside the cluster and that 10.177.200.* is meant to be external routes.

Revision history for this message
Simon Aronsson (0x12b) wrote :

It is correct that Traefik was missing a hook (pebble ready more specifically), however this feels like a workaround rather than a solution. The more interesting question in my eyes is why Juju reverts the patch in the first place?

Revision history for this message
Nishant Dash (dash3) wrote (last edit ):

In case of the traefik charm, I was able to reproduce it as well with James' reproducer. The reason recreating the pod works in traefik rev110 is because it calls the k8s service patch on install/upgrade charm hooks after the pod is recreated both of which do a kubernetes service patch.

From rev 123 of traefik onwards, traefik started patching the service on every update-status hook.

However just recently, it was patched with this [1] which showed that the continuous patching was not needed and just the pebble ready event was missing. That being said, the fact still remains that juju reverts the patch.

For the IP to supply, you can give it any ips, does not matter. I have personally tested this in a single VM.

[1] https://github.com/canonical/traefik-k8s-operator/pull/258

Revision history for this message
John A Meinel (jameinel) wrote :

Patching a service is always going to be racy, as Juju's role is "here is a model, modify the world to make this so". You can get lucky to change things while Juju isn't looking, but generally when it thinks it needs to double check if everything is consistent, it is going to evaluate the state of the world and how that differs from its internal model.
The real answer is to change the juju<->charm interaction so the charm is updating juju's model so that it is enforcing what you need. In the short term, you need to exercise diligence because something like 'add-unit' could easily cause Juju to re-evaluate the state of the world and see that it doesn't match expectations.

Juju *doesn't* follow the terraform model of tracking a 'these are the things that I applied relative to the current plan', and ignoring any other changes.

I don't know the specifics of why we are looking at the service definition and changing it, but in the general sense we are always trying to make the world match our model.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.