MetalLB Operator

Non-leader units get stuck with idle/waiting for leadership status

Bug #1903313 reported by Camille Rodriguez on 2020-11-06

This bug affects 4 people

Affects		Status	Importance	Assigned to	Milestone
	CoreDNS Charm	Fix Released	High	Unassigned	CoreDNS Charm 1.24+ck1
	MetalLB Operator	Fix Released	High	Unassigned	MetalLB Operator 1.25

Bug Description

When deploying metallb in a multi-node setup, the juju status reports that the "leader" speaker is active/idle, but that the non-leader units are waiting for leadership.

This seems to be only a juju status bug, not affecting how the metallb-speakers are deployed in kubernetes. Since it is a daemonset, in the kubernetes status you will see that the speakers pods are deploying on each node.

However, this bug shows that multi-node setup, such as with a microk8s cluster, or a Charmed Kubernetes deployment, should be done more thoroughly. I am planning to run some more tests in a multi-node scenario to find out if other things might be broken.

Cheers!

Tags:

Revision history for this message

Cory Johns (johnsca) wrote on 2020-11-06:

This is an issue with the current approach Juju takes for associating units with workload pods. In theory, a "unit" in Juju represents one of the pods of the workload. However, these "units" are only logical units; the charm is run in a single separate operator pod which then invoked in the context of each logic unit. However, the relationship between the charm and the workload via Juju is one-way: when in the "leader" context, it can set the pod spec, but in none of the unit contexts does it have any way (via Juju) to check the status of the workload pod which that unit context is associated with.

So, we could change the charm to report "active" for all unit contexts, but that still might not match the reality of the workload pod. But perhaps there are some non-Juju ways for the charm to query the status of the workload pod?

This won't apply to the upcoming sidecar approach, but I don't know much about how that will work.

George Kraft (cynerva) on 2020-11-09

Changed in operator-metallb:
importance:	Undecided → Medium
status:	New → Triaged

Chris Johnston (cjohnston) on 2021-03-22

tags:

added: sts

Revision history for this message

George Kraft (cynerva) wrote on 2021-03-23:

I'm marking this as incomplete for CoreDNS until someone can explain how CoreDNS ties in with the existing issue. So far the bug description and comments have been about charm status with a DaemonSet charm in a multi-node scenario.

Changed in operator-metallb:
importance:	Medium → High
Changed in charm-coredns:
status:	New → Incomplete

Revision history for this message

Chris Johnston (cjohnston) wrote on 2021-03-23:

Sorry for not providing it earlier, but here's a juju status after doing add-unit -n2:

https://paste.ubuntu.com/p/W4Z4TcGFv5/

George Kraft (cynerva) on 2021-03-23

Changed in charm-coredns:
status:	Incomplete → New

Revision history for this message

George Kraft (cynerva) wrote on 2021-03-23:

Ahh thanks, I get it now.

Both the metallb-operator charms and the CoreDNS charm explicitly set the "Waiting for leadership" status[1][2][3]. No doubt the way Juju handles k8s operator units and their status is weird as hell, but setting that aside, I think Active is a more appropriate status for this scenario. The units aren't really waiting for anything that's needed to bring the workload up.

[1]: https://github.com/charmed-kubernetes/metallb-operator/blob/ffd28a76c2584dad632fb5b7dc0e086a5b9b8a45/charms/metallb-controller/src/charm.py#L35
[2]: https://github.com/charmed-kubernetes/metallb-operator/blob/ffd28a76c2584dad632fb5b7dc0e086a5b9b8a45/charms/metallb-speaker/src/charm.py#L34
[3]: https://github.com/charmed-kubernetes/charm-coredns/blob/0cc553516f51fb6296f2e9c4efec096b94ebc8e8/src/charm.py#L18

summary:	- Speaker status in multi-node setup is stuck on idle/waiting for - leadership + Non leader-units get stuck with idle/waiting for leadership status
summary:	- Non leader-units get stuck with idle/waiting for leadership status + Non-leader units get stuck with idle/waiting for leadership status
Changed in charm-coredns:
importance:	Undecided → High
status:	New → Triaged

Revision history for this message

Peter Jose De Sousa (pjds) wrote on 2021-11-30:

subscribing field high as this blocks multiple units of coredns being deployed with charms (Recommended minimum is two).

Workaround is to stay with one unit, adding two results in an in-operable coredns service.

Revision history for this message

Peter Jose De Sousa (pjds) wrote on 2021-11-30:

subscribing field medium; changes needing merge: https://github.com/charmed-kubernetes/charm-coredns/pull/16

George Kraft (cynerva) on 2022-01-11

Changed in charm-coredns:
status:	Triaged → Fix Committed
Changed in operator-metallb:
status:	Triaged → Fix Committed
Changed in charm-coredns:
assignee:	nobody → Peter De Sousa (pjds)
Changed in operator-metallb:
status:	Fix Committed → Triaged
Changed in charm-coredns:
milestone:	none → 1.23+ck1
tags:	added: backport-needed

Peter Jose De Sousa (pjds) on 2022-02-08

Changed in charm-coredns:
assignee:	Peter De Sousa (pjds) → nobody

Kevin W Monroe (kwmonroe) on 2022-05-03

Changed in charm-coredns:
milestone:	1.23+ck1 → 1.24

Kevin W Monroe (kwmonroe) on 2022-05-10

Changed in charm-coredns:
status:	Fix Committed → Fix Released

Revision history for this message

George Kraft (cynerva) wrote on 2022-06-13:

Per ~cjohnston, the fix was accidentally reverted[1] so it didn't actually go out with CK 1.24. We will target this for 1.24+ck1 instead.

[1]: https://github.com/charmed-kubernetes/charm-coredns/pull/20

Changed in charm-coredns:
status:	Fix Released → Triaged
milestone:	1.24 → 1.24+ck1

Chris Johnston (cjohnston) on 2022-07-11

Changed in charm-coredns:
status:	Triaged → Fix Committed

Revision history for this message

Adam Dyess (addyess) wrote on 2022-07-27:

coredns resolved here: https://github.com/charmed-kubernetes/charm-coredns/pull/28

Adam Dyess (addyess) on 2022-08-01

tags:

removed: backport-needed

Adam Dyess (addyess) on 2022-08-04

Changed in charm-coredns:
status:	Fix Committed → Fix Released

Revision history for this message

Adam Dyess (addyess) wrote on 2022-08-12:

Addressed in metallb operators
https://github.com/charmed-kubernetes/metallb-operator/pull/23

Adam Dyess (addyess) on 2022-08-12

Changed in operator-metallb:
milestone:	none → 1.25
status:	Triaged → Fix Committed

Adam Dyess (addyess) on 2022-09-08

Changed in operator-metallb:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.