enable-ha can end up in an unsolvable state when there is an error during deployment

Bug #1685883 reported by Witold Krecicki
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Medium
John A Meinel

Bug Description

Scenario:
$ juju bootstrap maas
$ juju enable-ha -n 3 --to machine1,machine2

When it's already allocated but not yet deployed - kill machine1 in MAAS (it can also be later, then it won't be 'PENDING' but 'DOWN')

Wait for deployment on machine2 to finish, and then:

$ juju enable-ha -n 3 --to machine1

This will demote machine 1@machine1 (still in pending/down state) and add machine 3@machine1

After all the machines are deployed and HA cluster is working (machines 0,2,3 in 'ha-enabled' state) we've got the following situation:

Machine State DNS Inst id Series AZ Message
0 started 10.2.15.254 ct3y86 xenial default Deployed
1 pending 10.2.0.3 b73t3a xenial default Deploying: ubuntu/amd64/ga-16.04/xenial/daily/boot-initrd
2 started 10.2.0.4 tg34tr xenial default Deployed
3 started 10.2.0.3 b73t3a xenial default Deployed

With two Juju machines set up on one MAAS machine.

To clean up the HA state:
$ juju enable-ha -n 3
maintaining machines: 0, 2, 3
removing machines: 1

And now we can safely remove machine 1:
$ juju remove-machine 1 --force

According to juju everything is OK:
Machine State DNS Inst id Series AZ Message
0 started 10.2.15.254 ct3y86 xenial default Deployed
2 started 10.2.0.4 tg34tr xenial default Deployed
3 started 10.2.0.3 b73t3a xenial default Deployed

But the facts are that b73t3a was decomissioned by Juju from MAAS and it's really dead (juju doesn't notice it for quite some time - IMHO it should be more robust)

I haven't found a way to remove this pending/dead 'doppelganger' machine from machines list, maybe a '--no-action' switch to remove-machine would be needed?

tags: added: 4010
Witold Krecicki (wpk)
description: updated
Revision history for this message
Witold Krecicki (wpk) wrote :

It is a general recipe for disaster - add a machine, remove it in MAAS (in real world that would be e.g. HDD failure, all hw identifiers stay the same but the machine is 'clean'), and then add it again in juju, we end up with:

Machine State DNS Inst id Series AZ Message
0 down 10.2.0.7 axp6tp xenial default Deployed
1 started 10.2.0.7 axp6tp xenial default Deployed

Tim Penhey (thumper)
Changed in juju:
status: New → Triaged
importance: Undecided → Medium
tags: added: ha polish remove-machine
Ante Karamatić (ivoks)
tags: added: cpe-onsite
Revision history for this message
John A Meinel (jameinel) wrote :

this should now be addressed in 2.4-beta1 with "juju remove-machine" being able to directly target a controller.

Changed in juju:
assignee: nobody → John A Meinel (jameinel)
milestone: none → 2.4-beta2
status: Triaged → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.