Juju HA fails due to demotion of Machine 0

Bug #1748275 reported by Chris Gregan on 2018-02-08
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju
Critical
Heather Lanigan
2.3
Critical
Heather Lanigan

Bug Description

Finish of bootstrap:
00:15:56 INFO cmd controller.go:88 Controller machines are in the "controller" model
00:15:56 INFO cmd controller.go:89 Initial model "default" added
00:15:56 INFO cmd supercommand.go:465 command finished

We ran "juju enable-ha -c controller" next, and see this:
adding machines: 1, 2, 3
demoting machines: 0

Juju status:
http://paste.ubuntu.com/26542659/

This demotion of machine 0 is something we have not seen before. Typically we see:
maintaining machines: 0
adding machines: 1, 2

Chris Gregan (cgregan) wrote :
tags: added: foundations-engine
description: updated
tags: added: cpe-onsite
Chris Gregan (cgregan) wrote :

Currenlty running into this issue at a customer site. Priority bumped to field critical

Tim Penhey (thumper) wrote :

How often is it occurring? Are we able to bootstrap with more logging?

Changed in juju:
status: New → Triaged
importance: Undecided → Critical
Heather Lanigan (hmlanigan) wrote :

Reproduce via:
$ $ juju bootstrap localhost ha ; juju enable-ha
Creating Juju controller "ha" on localhost/localhost
Looking for packaged Juju agent version 2.3.2 for amd64
To configure your system to better support LXD containers, please see: https://github.com/lxc/lxd/blob/master/doc/production-setup.md
Launching controller instance(s) on localhost/localhost...
 - juju-6e7abd-0 (arch=amd64)
Installing Juju agent on bootstrap instance
Fetching Juju GUI 2.11.3
Waiting for address
Attempting to connect to 10.63.22.100:22
Connected to 10.63.22.100
Running machine configuration script...
Bootstrap agent now started
Contacting Juju controller at 10.63.22.100 to verify accessibility...
Bootstrap complete, "ha" controller now available
Controller machines are in the "controller" model
Initial model "default" added
adding machines: 1, 2, 3
demoting machines: 0

However splitting into 2 commands, the time to type juju enable-ha once bootstrap completes is sufficient to avoid this problem

Tim Penhey (thumper) on 2018-02-13
Changed in juju:
assignee: nobody → Heather Lanigan (hmlanigan)

it sounds like a Work around is
juju bootstrap ; sleep 5: juju enable-ha
which should at least address the Field Critical nature.

My guess is that enable ha is reaching us before the Pinger for Machine 0
has clarified that the agent is alive and happy and thus shouldn't be
demoted.

We could add heuristics around this (if there is one controller than it
must be the one receiving the request and should never be demoted.)

Interestingly we also need something similar in the peer grouper, in that
if it wants to demote the current PRIMARY it must first trigger a step down
and reelection.

We could potentially try to seed the presence of machine 0 specially.

I wonder if this has to do with Presence batching. We now wait up to 1s
before we flush the presence to disk. Which could account for it being
missed. I believe we also throttle how often it re-read from the DB so if
we didn't flush and it did a scan, and we missed it, it wouldn't find it
until we allowed another refresh from disk.

John
=:->

On Feb 13, 2018 04:35, "Tim Penhey" <email address hidden> wrote:

> ** Changed in: juju
> Assignee: (unassigned) => Heather Lanigan (hmlanigan)
>
> --
> You received this bug notification because you are a member of Canonical
> Field Critical, which is subscribed to the bug report.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1748275
>
> Title:
> Juju HA fails due to demotion of Machine 0
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1748275/+subscriptions
>

Changed in juju:
status: Triaged → In Progress
Heather Lanigan (hmlanigan) wrote :

John's guess in #5 is correct. Fixed by not demoting a machine which hasn't pinged recently, if it's the machine running the apiserver doing the work, the machine HasVote() and WantsVote().

https://github.com/juju/juju/pull/8379

Changed in juju:
milestone: none → 2.3.3
Changed in juju:
milestone: 2.3.3 → none
Changed in juju:
milestone: none → 2.4-beta1
status: In Progress → Fix Committed
John A Meinel (jameinel) wrote :

I don't know whether the 8383 PR made it into 2.3.3 but it should at least make it into 2.3.4 for sure, as it has landed in the 2.3 branch.

John A Meinel (jameinel) wrote :

given that there has not been a 2.4 released *without* this fix, I'm ok saying that because it is fixed in develop, it can be treated as fix released.

Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments