Kubernetes Control Plane Charm

ERROR cannot write leadership settings: cannot write settings: not the leader

Bug #1845353 reported by Joshua Genet on 2019-09-25

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Kubernetes Control Plane Charm	Won't Fix	High	Eric Chen

Bug Description

We've seen this across several other charms. We think something in Juju changed that needs to be updated in the charm. Here's some of the related bugs:

https://bugs.launchpad.net/charm-ceph-mon/+bug/1760138
https://bugs.launchpad.net/charms/+source/rabbitmq-server/+bug/1654116
https://bugs.launchpad.net/juju/+bug/1833089

2019-09-25 08:16:50 DEBUG etcd-relation-changed ERROR cannot write leadership settings: cannot write settings: not the leader
2019-09-25 08:16:50 DEBUG etcd-relation-changed Traceback (most recent call last):
2019-09-25 08:16:50 DEBUG etcd-relation-changed File "/var/lib/juju/agents/unit-kubernetes-master-1/charm/hooks/etcd-relation-changed", line 22, in <module>
2019-09-25 08:16:50 DEBUG etcd-relation-changed main()
2019-09-25 08:16:50 DEBUG etcd-relation-changed File "/var/lib/juju/agents/unit-kubernetes-master-1/.venv/lib/python3.6/site-packages/charms/reactive/__init__.py", line 83, in main
2019-09-25 08:16:50 DEBUG etcd-relation-changed hookenv._run_atexit()
2019-09-25 08:16:50 DEBUG etcd-relation-changed File "/var/lib/juju/agents/unit-kubernetes-master-1/.venv/lib/python3.6/site-packages/charmhelpers/core/hookenv.py", line 1220, in _run_atexit
2019-09-25 08:16:50 DEBUG etcd-relation-changed callback(*args, **kwargs)
2019-09-25 08:16:50 DEBUG etcd-relation-changed File "/var/lib/juju/agents/unit-kubernetes-master-1/charm/reactive/kubernetes_master.py", line 516, in set_final_status
2019-09-25 08:16:50 DEBUG etcd-relation-changed get_dns_provider()
2019-09-25 08:16:50 DEBUG etcd-relation-changed File "/var/lib/juju/agents/unit-kubernetes-master-1/charm/reactive/kubernetes_master.py", line 2427, in get_dns_provider
2019-09-25 08:16:50 DEBUG etcd-relation-changed leader_set(auto_dns_provider=dns_provider)
2019-09-25 08:16:50 DEBUG etcd-relation-changed File "/var/lib/juju/agents/unit-kubernetes-master-1/.venv/lib/python3.6/site-packages/charms/reactive/decorators.py", line 219, in _wrapped
2019-09-25 08:16:50 DEBUG etcd-relation-changed return func(*args, **kwargs)
2019-09-25 08:16:50 DEBUG etcd-relation-changed File "lib/charms/leadership.py", line 62, in leader_set
2019-09-25 08:16:50 DEBUG etcd-relation-changed hookenv.leader_set(settings)
2019-09-25 08:16:50 DEBUG etcd-relation-changed File "/var/lib/juju/agents/unit-kubernetes-master-1/.venv/lib/python3.6/site-packages/charmhelpers/core/hookenv.py", line 1043, in inner_translate_exc2
2019-09-25 08:16:50 DEBUG etcd-relation-changed return f(*args, **kwargs)
2019-09-25 08:16:50 DEBUG etcd-relation-changed File "/var/lib/juju/agents/unit-kubernetes-master-1/.venv/lib/python3.6/site-packages/charmhelpers/core/hookenv.py", line 1104, in leader_set
2019-09-25 08:16:50 DEBUG etcd-relation-changed subprocess.check_call(cmd)
2019-09-25 08:16:50 DEBUG etcd-relation-changed File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
2019-09-25 08:16:50 DEBUG etcd-relation-changed raise CalledProcessError(retcode, cmd)
2019-09-25 08:16:50 DEBUG etcd-relation-changed subprocess.CalledProcessError: Command '['leader-set', 'auto_dns_provider=core-dns']' returned non-zero exit status 1.
2019-09-25 08:16:50 ERROR juju.worker.uniter.operation runhook.go:132 hook "etcd-relation-changed" failed: exit status 1

Tags:

Revision history for this message

Joshua Genet (genet022) wrote on 2019-09-25:

juju-crashdump-kubernetes-2019-09-25-08.17.28.tar.gz Edit (36.6 MiB, application/x-tar)

Revision history for this message

Joshua Genet (genet022) wrote on 2019-09-25:

juju-crashdump-controller-2019-09-25-08.17.10.tar.gz Edit (2.6 MiB, application/x-tar)

Probably not needed, but here's the juju-crashdump-controller as well.

Revision history for this message

Joshua Genet (genet022) wrote on 2019-09-25:

And to be explicit since I'm currently testing 1.16, this was in a 1.15 run. So not a 1.16 release-blocker or anything.

Revision history for this message

Tim Van Steenburgh (tvansteenburgh) wrote on 2019-09-25:

What rev of the kubernetes-master charm was this? This looks a lot like https://bugs.launchpad.net/charm-kubernetes-master/+bug/1833089, which was fixed in rev 695. See also https://discourse.jujucharms.com/t/charmed-kubernetes-users-todo-before-upgrading-models-to-2-6-4/1594.

Changed in charm-kubernetes-master:
status:	New → Incomplete

Revision history for this message

George Kraft (cynerva) wrote on 2019-09-25:

This looks very similar to https://bugs.launchpad.net/charm-kubernetes-master/+bug/1833089 but the kubernetes-master charm revision you're testing with (724) includes the fix for that. This is something new.

It's weird, but it looks like kubernetes-master-1 temporarily became confused about its state as a follower. Leading up to the failure, it thought it was a follower:

unit-kubernetes-master-1: 08:10:03 INFO unit.kubernetes-master/1.juju-log kube-control:5: Initializing Leadership Layer (is follower)
unit-kubernetes-master-1: 08:11:30 INFO unit.kubernetes-master/1.juju-log Initializing Leadership Layer (is follower)
unit-kubernetes-master-1: 08:13:32 INFO unit.kubernetes-master/1.juju-log etcd:8: Initializing Leadership Layer (is follower)

However, on the hook that failed, it thought it was a leader:

unit-kubernetes-master-1: 08:14:56 INFO unit.kubernetes-master/1.juju-log etcd:8: Initializing Leadership Layer (is leader)
...
unit-kubernetes-master-1: 08:16:50 DEBUG unit.kubernetes-master/1.etcd-relation-changed ERROR cannot write leadership settings: cannot write settings: not the leader

After that, it returned to understanding its state as a follower correctly:

unit-kubernetes-master-1: 08:17:20 INFO unit.kubernetes-master/1.juju-log etcd:8: Initializing Leadership Layer (is follower)

So this looks like it may be a bug in layer-leadership or in juju. Needs further investigation.

Revision history for this message

George Kraft (cynerva) wrote on 2019-09-25:

Given that the unit recovered on its own, I would not consider this a high priority issue.

Changed in charm-kubernetes-master:
status:	Incomplete → Confirmed

Revision history for this message

Ashley Lai (alai) wrote on 2019-10-18:

juju-crashdump-kubernetes-2019-10-18-03.35.12.tar.gz Edit (31.0 MiB, application/x-tar)

I am seeing similar failure but it is on kubernetes-master/1. Crash dump is attached.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2019-10-18:

sub'd to field high. solutions-qa runs all tests with retries disabled - it will not recover on its own and so it blocks our testing. We need charms to run without raising any errors when they're working properly. When we retry, we can suppress configuration and charm bugs, and it also makes it difficult to isolate failures and the logs associated with them. Openstack Engineering also follows this policy and we'd like to see the Kubernetes team follow it too.

Tim Van Steenburgh (tvansteenburgh) on 2020-05-05

Changed in charm-kubernetes-master:
importance:	Undecided → High

Tim Van Steenburgh (tvansteenburgh) on 2020-05-05

Changed in charm-kubernetes-master:
assignee:	nobody → Kevin W Monroe (kwmonroe)

George Kraft (cynerva) on 2020-05-08

Changed in charm-kubernetes-master:
status:	Confirmed → Triaged

Kevin W Monroe (kwmonroe) on 2021-05-14

Changed in charm-kubernetes-master:
assignee:	Kevin W Monroe (kwmonroe) → nobody

Eric Chen (eric-chen) on 2021-10-21

Changed in charm-kubernetes-master:
assignee:	nobody → YenHung Chen (eric-chen)

Revision history for this message

Eric Chen (eric-chen) wrote on 2021-11-01 (last edit on 2021-11-03):

Image Pasted at 2021-10-22 15-36.png Edit (42.4 KiB, image/png)

More information provided from Joshua.
https://solutions.qa.canonical.com/bugs/bugs/bug/1845353

* Cannot find a way to reproduce it manually.
* The last time this issue happened at 2020/Jan in QA environment.
* From the log, it looks like a race condition issue.
master_0 is the leader at most of time. But there is a moment, master_1 get the leader and try to do leader-set, at the meantime, master_0 grab the leader back and cause the exception.

related code:
https://github.com/charmed-kubernetes/charm-kubernetes-master/blob/fe4499dfb6a0996ee97228ae75ee5f70ae2c944f/reactive/kubernetes_master.py#L2458

Revision history for this message

Eric Chen (eric-chen) wrote on 2021-11-01 (last edit on 2021-11-03):

#10

According to this discussion thread, we cannot avoid the race condition.
https://bugs.launchpad.net/juju/+bug/1898966

"Leadership lease expiry is managed separately in the controller, so there will always be a window where a hook may start and leadership could be lost part way through."

Therefore, I decide not to solve the race condition issue here.
We should observe if there is any bug to trigger unexpected leadership changed.
However, I cannot find the root cause from the log. I decide to stop here and wait for other information some day.

Revision history for this message

Alvaro Uria (aluria) wrote on 2021-11-03:

#11

In addition to comment #10, all "leader-set" calls would need to be wrapped in try-except clauses to be sure a sudden change of leadership (after is-leader is checked) is managed by the charm. However, this approach would reduce charm readability, and has been discouraged by the charm maintainers.

I would suggest the following:
1) CI env: allow unit errors, followed by juju resolve commands, for a number of retries (e.g. 3)
2) Mark this bug as "won't fix" due to Juju's design, charm simplicity (no try-excepts for this) and unable to reproduce consistently

Revision history for this message

George Kraft (cynerva) wrote on 2021-11-09:

#12

Marking this as "won't fix" for the reasons given above. Thanks for looking into this folks.

Changed in charm-kubernetes-master:
status:	Triaged → Won't Fix

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.