Juju HA - Unable to remove controller machines in 'down' state

Bug #1658033 reported by Junaid Ali
64
This bug affects 12 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
John A Meinel

Bug Description

I'm testing Juju HA where I faced issues with removal of 'controller' machines in down state. I'm using MAAS as backend infrastructure provider.

After enabling-ha with 3 controllers, I released a controller machine from MAAS. Running 'juju enable-ha', brought up a new node but I wasn't able to remove old machine (that has machine state as 'down') from controller model. It errors out 'ERROR no machines were destroyed: machine is required by the model'.

controller-member-status also seems to be quite random. I have to run juju enable-ha again after reassuring HA so that correct values are populated.

$ juju status

Model Controller Cloud/Region Version
controller ctr-xenial maas 2.0.2

App Version Status Scale Charm Store Rev OS Notes

Unit Workload Agent Machine Public address Ports Message

Machine State DNS Inst id Series AZ
0 started 172.30.40.39 w666gw xenial default
1 down 172.30.40.46 4dgtc6 xenial default
2 started 172.30.40.251 c7rcwd xenial default
3 started 172.30.40.47 q6wfa3 xenial default

$ juju status --format yaml
http://paste.ubuntu.com/23832571/

Changed in juju:
status: New → Triaged
importance: Undecided → High
milestone: none → 2.2.0
Revision history for this message
David (dberardozzi) wrote :

I was facing same issue on Openstack cloud (OVH public cloud), and it seems that the cause was that I added a new Neutron network after initial bootstrap. I assumes this causes the controller to be confused with the network it should use for further instance provisioning.

After Bootstraping with --config network=<networkID> no more issue enabling HA.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

@David,

Thank you for the update. I'll mark this bug as Invalid since Juju works as expected with --config network=...

Changed in juju:
status: Triaged → Invalid
milestone: 2.2.0 → none
importance: High → Undecided
Revision history for this message
John A Meinel (jameinel) wrote :

So while the original submitter was able to work around this, it is still true that if machines are added as controllers and then fail to provision, they get stuck because we refuse to let you remove controller machines, but they aren't up for you to take them out of the HA group.
Probably peer-grouper needs a way to flag these as no-longer-important so that we can remove them. (see bug #1648799 or bug #1449633)

Changed in juju:
status: Invalid → Triaged
importance: Undecided → High
tags: added: 4010
tags: added: cpec
Ante Karamatić (ivoks)
tags: added: cpe-onsite
removed: cpec
Revision history for this message
Hans van den Bogert (hbogert) wrote :

Is reproduction still needed? I can get this in an scenario where I
3 out of 3 machines allocated in MAAS.
Then enable-ha. It then asks maas for two extra nodes, which it will never get. I then `juju enable-ha --to 1,2`. I get the mention that node 3 and 4 are demoted, but they never get removed.

Tim Penhey (thumper)
tags: added: ensure-availability
Tim Penhey (thumper)
Changed in juju:
assignee: nobody → John A Meinel (jameinel)
status: Triaged → In Progress
milestone: none → 2.4-beta1
Revision history for this message
John A Meinel (jameinel) wrote :
Revision history for this message
John A Meinel (jameinel) wrote :

earlier patch that implemented initial support for "juju remove-machine"
https://github.com/juju/juju/pull/8557

Changed in juju:
milestone: 2.4-beta1 → none
Revision history for this message
John A Meinel (jameinel) wrote :
John A Meinel (jameinel)
Changed in juju:
milestone: none → 2.4-beta2
John A Meinel (jameinel)
Changed in juju:
status: In Progress → Fix Committed
Revision history for this message
John A Meinel (jameinel) wrote :

In 2.4 beta 2 (and probably beta1), I just tested:
 juju bootstrap lxd
 juju enable-ha
 lxc stop juju-XXXXX-1
 juju status
 juju enable-ha # does nothing
 juju remove-machine 1 # properly removes machine 1 from being listed as a controller
 juju enable-ha # creates a 3rd machine

Just like all machines, if the machine agent isn't running, then the machine will be 'stuck' in a 'down' state:

$ juju status
Model Controller Cloud/Region Version SLA
controller lxd lxd 2.4-beta2.1 unsupported

App Version Status Scale Charm Store Rev OS Notes

Unit Workload Agent Machine Public address Ports Message

Machine State DNS Inst id Series AZ Message
0 started 10.16.17.214 juju-10a670-0 xenial Running
1 down 10.16.17.89 juju-10a670-1 xenial Stopped
2 started 10.16.17.189 juju-10a670-2 xenial Running
3 started 10.16.17.42 juju-10a670-3 xenial Running

But it is removed as a controller:
$ juju show-controller
lxd:
...
  controller-machines:
    "0":
      instance-id: juju-10a670-0
      ha-status: ha-enabled
    "2":
      instance-id: juju-10a670-2
      ha-status: ha-enabled
    "3":
      instance-id: juju-10a670-3
      ha-status: ha-enabled

To actually purge the machine (like you need to do for normal application machines), you can use "juju remove-machine --force":
$ juju status
Model Controller Cloud/Region Version SLA
controller lxd lxd 2.4-beta2.1 unsupported

App Version Status Scale Charm Store Rev OS Notes

Unit Workload Agent Machine Public address Ports Message

Machine State DNS Inst id Series AZ Message
0 started 10.16.17.214 juju-10a670-0 xenial Running
2 started 10.16.17.189 juju-10a670-2 xenial Running
3 started 10.16.17.42 juju-10a670-3 xenial Running

(note that remove-machine --force triggers a 'cleanup' action, so it does not immediately remove the machine. So an immediate 'status' will not show it as gone, but after a few seconds it does go away.)

Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.