juju controllers couldn't elect raft leader

Bug #1942250 reported by Adam Dyess
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Won't Fix
High
Joseph Phillips

Bug Description

Version: juju 2.8.6
Story:

Prior to an openstack upgrade, the juju controllers did not agree on the leadership of a mysql application. 2 of the 3 controllers believe one unit was the app leader, and the other controller believed another was the app leader.

Using the output of the fault controller (we didn't know of the fault at the time), we paused mysql on the "non-leader" mysql units and ran a prepare series-upgrade to bionic on the leader. This application tried to 'leader-set' something in one of its hooks and failed because it was not the leader according to two of the controllers.

The juju team asked us to check mongodb, where the replication status agreed on all three units which was the app leader (the one that 2 of 3 indicated with an asterisk in juju status). the juju team first believed leadership was flipping, but `juju status --debug` indicated the fault leadership continued to come from the same controller.

the juju team had us stop the two agreeing controllers, leave active the controller that indicated the leader was the one being upgraded, cancel the prepare series upgrade, reset the db to remove the upgradeSeriesLock on that unit, update the machine status so they appeared to have not started the series upgrade.

next the juju team had us try to remove the unit which was stuck trying to finish the prepare-series upgrade and deploy a replacement unit. At this point juju would no longer elect a leader for this application

Ultimately, the juju team had us stop all three controllers, load a specially prepared raft-log binary at /var/lib/juju/raft/, start the controllers such that one was elected the raft leader, and then start the other two afterwards. the mysql application then had a leader elected and we could continue the prepare-series-upgrade.

John A Meinel (jameinel)
Changed in juju:
importance: Undecided → High
milestone: none → 2.9-next
status: New → Triaged
assignee: nobody → Simon Richardson (simonrichardson)
Revision history for this message
Simon Richardson (simonrichardson) wrote :
Revision history for this message
Simon Richardson (simonrichardson) wrote :

The previous message details step 1 of a number of steps to allow raft to recover from a leadership flapping. The PR in question attempts to solve the pre-voting campaign optimisation described in the raft paper.

The second optimisation will need to be worked on when checks for quorum of a given cluster to ensure that forward progress can be made in certain situations.

Changed in juju:
status: Triaged → In Progress
Harry Pidcock (hpidcock)
Changed in juju:
assignee: Simon Richardson (simonrichardson) → Joseph Phillips (manadart)
Revision history for this message
Joseph Phillips (manadart) wrote :

I poked over on the patch, which is still unmerged.

Revision history for this message
Joseph Phillips (manadart) wrote :

I've marked this as won't fix because we've removed Raft-backed leases for the 3.2 branch.

Changed in juju:
status: In Progress → Won't Fix
milestone: 2.9-next → none
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.