unit leadership gets confused

Bug #1656275 reported by John A Meinel
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Tim Penhey

Bug Description

I've been testing the logging output of unit leadership, but it appears we can get into a case where nothing thinks it is the leader.

Specifically, if I have 2 units, and unit/1 is the leader and unit/2 is not. If I then bounce all of the agents (for example by doing juju upgrade-juju), when the agents come back up I see:

unit-ul-2: 14:44:57 DEBUG juju.worker.leadership ul/2 making initial claim for ul leadership
unit-ul-2: 14:44:57 INFO juju.worker.leadership ul leadership for ul/2 denied
unit-ul-2: 14:44:57 DEBUG juju.worker.leadership ul/2 waiting for ul leadership release
unit-ul-1: 14:44:57 DEBUG juju.worker.leadership ul/1 making initial claim for ul leadership

Note that unit 2 clearly got the "I'm not the leader" message, but there is *no* entry for unit-1 saying "I'm the leader". It just goes to make a claim, and that seems to never return.

If I kill the existing unit/1, I can see in status that unit/2 becomes the leader. However, there is also *no* log entry (from juju debug-log) that shows that the current ul/2 agent becomes aware of that fact. If I add another unit to the application it doesn't seem to notice. But I also see:

unit-ul-4: 14:51:12 DEBUG juju.worker.dependency "leadership-tracker" manifold worker stopped: "migration-inactive-flag" not running: dependency not available

Maybe its a different bug. I've certainly never touched anything about migration for this model.

Revision history for this message
Christian Muirhead (2-xtian) wrote :

I *think* this behaviour is caused by the same underlying bug as https://bugs.launchpad.net/juju/+bug/1815397

The claim hangs because the lease manager is trying to shut down, but it can't because the claim handler is trying to send on the errors channel which the main loop is no longer listening on.

It should be fixed by https://github.com/juju/juju/pull/9730

Changed in juju:
status: Triaged → Fix Committed
milestone: none → 2.5.1
assignee: nobody → Tim Penhey (thumper)
Revision history for this message
John A Meinel (jameinel) wrote :

I think the actual issue here was that the Lease code could fail to update Mongo with who the current leader was, and Status only ever read Mongo.

I believe we have a proposal to have Status interrogate the Raft engine instead of the database, which would fix this.

Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.