ERROR cannot write leadership settings: cannot write settings: not the leader

Bug #1845353 reported by Joshua Genet
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Kubernetes Control Plane Charm
Won't Fix
High
Eric Chen

Bug Description

We've seen this across several other charms. We think something in Juju changed that needs to be updated in the charm. Here's some of the related bugs:

https://bugs.launchpad.net/charm-ceph-mon/+bug/1760138
https://bugs.launchpad.net/charms/+source/rabbitmq-server/+bug/1654116
https://bugs.launchpad.net/juju/+bug/1833089

2019-09-25 08:16:50 DEBUG etcd-relation-changed ERROR cannot write leadership settings: cannot write settings: not the leader
2019-09-25 08:16:50 DEBUG etcd-relation-changed Traceback (most recent call last):
2019-09-25 08:16:50 DEBUG etcd-relation-changed File "/var/lib/juju/agents/unit-kubernetes-master-1/charm/hooks/etcd-relation-changed", line 22, in <module>
2019-09-25 08:16:50 DEBUG etcd-relation-changed main()
2019-09-25 08:16:50 DEBUG etcd-relation-changed File "/var/lib/juju/agents/unit-kubernetes-master-1/.venv/lib/python3.6/site-packages/charms/reactive/__init__.py", line 83, in main
2019-09-25 08:16:50 DEBUG etcd-relation-changed hookenv._run_atexit()
2019-09-25 08:16:50 DEBUG etcd-relation-changed File "/var/lib/juju/agents/unit-kubernetes-master-1/.venv/lib/python3.6/site-packages/charmhelpers/core/hookenv.py", line 1220, in _run_atexit
2019-09-25 08:16:50 DEBUG etcd-relation-changed callback(*args, **kwargs)
2019-09-25 08:16:50 DEBUG etcd-relation-changed File "/var/lib/juju/agents/unit-kubernetes-master-1/charm/reactive/kubernetes_master.py", line 516, in set_final_status
2019-09-25 08:16:50 DEBUG etcd-relation-changed get_dns_provider()
2019-09-25 08:16:50 DEBUG etcd-relation-changed File "/var/lib/juju/agents/unit-kubernetes-master-1/charm/reactive/kubernetes_master.py", line 2427, in get_dns_provider
2019-09-25 08:16:50 DEBUG etcd-relation-changed leader_set(auto_dns_provider=dns_provider)
2019-09-25 08:16:50 DEBUG etcd-relation-changed File "/var/lib/juju/agents/unit-kubernetes-master-1/.venv/lib/python3.6/site-packages/charms/reactive/decorators.py", line 219, in _wrapped
2019-09-25 08:16:50 DEBUG etcd-relation-changed return func(*args, **kwargs)
2019-09-25 08:16:50 DEBUG etcd-relation-changed File "lib/charms/leadership.py", line 62, in leader_set
2019-09-25 08:16:50 DEBUG etcd-relation-changed hookenv.leader_set(settings)
2019-09-25 08:16:50 DEBUG etcd-relation-changed File "/var/lib/juju/agents/unit-kubernetes-master-1/.venv/lib/python3.6/site-packages/charmhelpers/core/hookenv.py", line 1043, in inner_translate_exc2
2019-09-25 08:16:50 DEBUG etcd-relation-changed return f(*args, **kwargs)
2019-09-25 08:16:50 DEBUG etcd-relation-changed File "/var/lib/juju/agents/unit-kubernetes-master-1/.venv/lib/python3.6/site-packages/charmhelpers/core/hookenv.py", line 1104, in leader_set
2019-09-25 08:16:50 DEBUG etcd-relation-changed subprocess.check_call(cmd)
2019-09-25 08:16:50 DEBUG etcd-relation-changed File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
2019-09-25 08:16:50 DEBUG etcd-relation-changed raise CalledProcessError(retcode, cmd)
2019-09-25 08:16:50 DEBUG etcd-relation-changed subprocess.CalledProcessError: Command '['leader-set', 'auto_dns_provider=core-dns']' returned non-zero exit status 1.
2019-09-25 08:16:50 ERROR juju.worker.uniter.operation runhook.go:132 hook "etcd-relation-changed" failed: exit status 1

Revision history for this message
Joshua Genet (genet022) wrote :
Revision history for this message
Joshua Genet (genet022) wrote :

Probably not needed, but here's the juju-crashdump-controller as well.

Revision history for this message
Joshua Genet (genet022) wrote :

And to be explicit since I'm currently testing 1.16, this was in a 1.15 run. So not a 1.16 release-blocker or anything.

Revision history for this message
Tim Van Steenburgh (tvansteenburgh) wrote :

What rev of the kubernetes-master charm was this? This looks a lot like https://bugs.launchpad.net/charm-kubernetes-master/+bug/1833089, which was fixed in rev 695. See also https://discourse.jujucharms.com/t/charmed-kubernetes-users-todo-before-upgrading-models-to-2-6-4/1594.

Changed in charm-kubernetes-master:
status: New → Incomplete
Revision history for this message
George Kraft (cynerva) wrote :

This looks very similar to https://bugs.launchpad.net/charm-kubernetes-master/+bug/1833089 but the kubernetes-master charm revision you're testing with (724) includes the fix for that. This is something new.

It's weird, but it looks like kubernetes-master-1 temporarily became confused about its state as a follower. Leading up to the failure, it thought it was a follower:

unit-kubernetes-master-1: 08:10:03 INFO unit.kubernetes-master/1.juju-log kube-control:5: Initializing Leadership Layer (is follower)
unit-kubernetes-master-1: 08:11:30 INFO unit.kubernetes-master/1.juju-log Initializing Leadership Layer (is follower)
unit-kubernetes-master-1: 08:13:32 INFO unit.kubernetes-master/1.juju-log etcd:8: Initializing Leadership Layer (is follower)

However, on the hook that failed, it thought it was a leader:

unit-kubernetes-master-1: 08:14:56 INFO unit.kubernetes-master/1.juju-log etcd:8: Initializing Leadership Layer (is leader)
...
unit-kubernetes-master-1: 08:16:50 DEBUG unit.kubernetes-master/1.etcd-relation-changed ERROR cannot write leadership settings: cannot write settings: not the leader

After that, it returned to understanding its state as a follower correctly:

unit-kubernetes-master-1: 08:17:20 INFO unit.kubernetes-master/1.juju-log etcd:8: Initializing Leadership Layer (is follower)

So this looks like it may be a bug in layer-leadership or in juju. Needs further investigation.

Revision history for this message
George Kraft (cynerva) wrote :

Given that the unit recovered on its own, I would not consider this a high priority issue.

Changed in charm-kubernetes-master:
status: Incomplete → Confirmed
Revision history for this message
Ashley Lai (alai) wrote :

I am seeing similar failure but it is on kubernetes-master/1. Crash dump is attached.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

sub'd to field high. solutions-qa runs all tests with retries disabled - it will not recover on its own and so it blocks our testing. We need charms to run without raising any errors when they're working properly. When we retry, we can suppress configuration and charm bugs, and it also makes it difficult to isolate failures and the logs associated with them. Openstack Engineering also follows this policy and we'd like to see the Kubernetes team follow it too.

Changed in charm-kubernetes-master:
importance: Undecided → High
Changed in charm-kubernetes-master:
assignee: nobody → Kevin W Monroe (kwmonroe)
George Kraft (cynerva)
Changed in charm-kubernetes-master:
status: Confirmed → Triaged
Changed in charm-kubernetes-master:
assignee: Kevin W Monroe (kwmonroe) → nobody
Eric Chen (eric-chen)
Changed in charm-kubernetes-master:
assignee: nobody → YenHung Chen (eric-chen)
Revision history for this message
Eric Chen (eric-chen) wrote (last edit ):

More information provided from Joshua.
https://solutions.qa.canonical.com/bugs/bugs/bug/1845353

* Cannot find a way to reproduce it manually.
* The last time this issue happened at 2020/Jan in QA environment.
* From the log, it looks like a race condition issue.
  master_0 is the leader at most of time. But there is a moment, master_1 get the leader and try to do leader-set, at the meantime, master_0 grab the leader back and cause the exception.

related code:
https://github.com/charmed-kubernetes/charm-kubernetes-master/blob/fe4499dfb6a0996ee97228ae75ee5f70ae2c944f/reactive/kubernetes_master.py#L2458

Revision history for this message
Eric Chen (eric-chen) wrote (last edit ):

According to this discussion thread, we cannot avoid the race condition.
https://bugs.launchpad.net/juju/+bug/1898966

"Leadership lease expiry is managed separately in the controller, so there will always be a window where a hook may start and leadership could be lost part way through."

Therefore, I decide not to solve the race condition issue here.
We should observe if there is any bug to trigger unexpected leadership changed.
However, I cannot find the root cause from the log. I decide to stop here and wait for other information some day.

Revision history for this message
Alvaro Uria (aluria) wrote :

In addition to comment #10, all "leader-set" calls would need to be wrapped in try-except clauses to be sure a sudden change of leadership (after is-leader is checked) is managed by the charm. However, this approach would reduce charm readability, and has been discouraged by the charm maintainers.

I would suggest the following:
1) CI env: allow unit errors, followed by juju resolve commands, for a number of retries (e.g. 3)
2) Mark this bug as "won't fix" due to Juju's design, charm simplicity (no try-excepts for this) and unable to reproduce consistently

Revision history for this message
George Kraft (cynerva) wrote :

Marking this as "won't fix" for the reasons given above. Thanks for looking into this folks.

Changed in charm-kubernetes-master:
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.