Canonical Juju

Juju HA - Unable to remove controller machines in 'down' state

Bug #1658033 reported by Junaid Ali on 2017-01-20

This bug affects 12 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	High	John A Meinel	Canonical Juju 2.4-beta2

Bug Description

I'm testing Juju HA where I faced issues with removal of 'controller' machines in down state. I'm using MAAS as backend infrastructure provider.

After enabling-ha with 3 controllers, I released a controller machine from MAAS. Running 'juju enable-ha', brought up a new node but I wasn't able to remove old machine (that has machine state as 'down') from controller model. It errors out 'ERROR no machines were destroyed: machine is required by the model'.

controller-member-status also seems to be quite random. I have to run juju enable-ha again after reassuring HA so that correct values are populated.

$ juju status

Model Controller Cloud/Region Version
controller ctr-xenial maas 2.0.2

App Version Status Scale Charm Store Rev OS Notes

Unit Workload Agent Machine Public address Ports Message

Machine State DNS Inst id Series AZ
0 started 172.30.40.39 w666gw xenial default
1 down 172.30.40.46 4dgtc6 xenial default
2 started 172.30.40.251 c7rcwd xenial default
3 started 172.30.40.47 q6wfa3 xenial default

$ juju status --format yaml
http://paste.ubuntu.com/23832571/

Tags:

Anastasia (anastasia-macmood) on 2017-01-31

Changed in juju:
status:	New → Triaged
importance:	Undecided → High
milestone:	none → 2.2.0

Revision history for this message

David (dberardozzi) wrote on 2017-02-04:

I was facing same issue on Openstack cloud (OVH public cloud), and it seems that the cause was that I added a new Neutron network after initial bootstrap. I assumes this causes the controller to be confused with the network it should use for further instance provisioning.

After Bootstraping with --config network=<networkID> no more issue enabling HA.

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2017-02-05:

@David,

Thank you for the update. I'll mark this bug as Invalid since Juju works as expected with --config network=...

Changed in juju:
status:	Triaged → Invalid
milestone:	2.2.0 → none
importance:	High → Undecided

Revision history for this message

John A Meinel (jameinel) wrote on 2017-04-17:

So while the original submitter was able to work around this, it is still true that if machines are added as controllers and then fail to provision, they get stuck because we refuse to let you remove controller machines, but they aren't up for you to take them out of the HA group.
Probably peer-grouper needs a way to flag these as no-longer-important so that we can remove them. (see bug #1648799 or bug #1449633)

Changed in juju:
status:	Invalid → Triaged
importance:	Undecided → High

Paolo de Rosa (paolo-de-rosa) on 2017-04-17

tags:

added: 4010

Dmitrii Shcherbakov (dmitriis) on 2017-09-27

tags:

added: cpec

Ante Karamatić (ivoks) on 2017-09-27

tags:

added: cpe-onsite
removed: cpec

Revision history for this message

Hans van den Bogert (hbogert) wrote on 2017-12-18:

Is reproduction still needed? I can get this in an scenario where I
3 out of 3 machines allocated in MAAS.
Then enable-ha. It then asks maas for two extra nodes, which it will never get. I then `juju enable-ha --to 1,2`. I get the mention that node 3 and 4 are demoted, but they never get removed.

Tim Penhey (thumper) on 2018-02-07

tags:

added: ensure-availability

Tim Penhey (thumper) on 2018-03-21

Changed in juju:
assignee:	nobody → John A Meinel (jameinel)
status:	Triaged → In Progress
milestone:	none → 2.4-beta1

Revision history for this message

John A Meinel (jameinel) wrote on 2018-04-18:

A patch along the way.

https://github.com/juju/juju/pull/8625

Revision history for this message

John A Meinel (jameinel) wrote on 2018-04-18:

earlier patch that implemented initial support for "juju remove-machine"
https://github.com/juju/juju/pull/8557

Canonical Juju QA Bot (juju-qa-bot) on 2018-04-18

Changed in juju:
milestone:	2.4-beta1 → none

Revision history for this message

John A Meinel (jameinel) wrote on 2018-04-19:

A related patch
https://github.com/juju/juju/pull/8627

John A Meinel (jameinel) on 2018-04-19

Changed in juju:
milestone:	none → 2.4-beta2

John A Meinel (jameinel) on 2018-04-24

Changed in juju:
status:	In Progress → Fix Committed

Revision history for this message

John A Meinel (jameinel) wrote on 2018-04-24:

Just like all machines, if the machine agent isn't running, then the machine will be 'stuck' in a 'down' state:

$ juju status
Model Controller Cloud/Region Version SLA
controller lxd lxd 2.4-beta2.1 unsupported

App Version Status Scale Charm Store Rev OS Notes

Unit Workload Agent Machine Public address Ports Message

Machine State DNS Inst id Series AZ Message
0 started 10.16.17.214 juju-10a670-0 xenial Running
1 down 10.16.17.89 juju-10a670-1 xenial Stopped
2 started 10.16.17.189 juju-10a670-2 xenial Running
3 started 10.16.17.42 juju-10a670-3 xenial Running

But it is removed as a controller:
$ juju show-controller
lxd:
...
  controller-machines:
    "0":
      instance-id: juju-10a670-0
      ha-status: ha-enabled
    "2":
      instance-id: juju-10a670-2
      ha-status: ha-enabled
    "3":
      instance-id: juju-10a670-3
      ha-status: ha-enabled

To actually purge the machine (like you need to do for normal application machines), you can use "juju remove-machine --force":
$ juju status
Model Controller Cloud/Region Version SLA
controller lxd lxd 2.4-beta2.1 unsupported

App Version Status Scale Charm Store Rev OS Notes

Unit Workload Agent Machine Public address Ports Message

Machine State DNS Inst id Series AZ Message
0 started 10.16.17.214 juju-10a670-0 xenial Running
2 started 10.16.17.189 juju-10a670-2 xenial Running
3 started 10.16.17.42 juju-10a670-3 xenial Running

(note that remove-machine --force triggers a 'cleanup' action, so it does not immediately remove the machine. So an immediate 'status' will not show it as gone, but after a few seconds it does go away.)

In 2.4 beta 2 (and probably beta1), I just tested:
 juju bootstrap lxd
 juju enable-ha
 lxc stop juju-XXXXX-1
 juju status
 juju enable-ha # does nothing
 juju remove-machine 1 # properly removes machine 1 from being listed as a controller
 juju enable-ha # creates a 3rd machine
 
Just like all machines, if the machine agent isn't running, then the machine will be 'stuck' in a 'down' state:

$ juju status
Model       Controller  Cloud/Region  Version      SLA
controller  lxd         lxd           2.4-beta2.1  unsupported

App  Version  Status  Scale  Charm  Store  Rev  OS  Notes

Unit  Workload  Agent  Machine  Public address  Ports  Message

Machine  State    DNS           Inst id        Series  AZ  Message
0        started  10.16.17.214  juju-10a670-0  xenial      Running
1        down     10.16.17.89   juju-10a670-1  xenial      Stopped
2        started  10.16.17.189  juju-10a670-2  xenial      Running
3        started  10.16.17.42   juju-10a670-3  xenial      Running

To actually purge the machine (like you need to do for normal application machines), you can use "juju remove-machine --force":
$ juju status
Model       Controller  Cloud/Region  Version      SLA
controller  lxd         lxd           2.4-beta2.1  unsupported

App  Version  Status  Scale  Charm  Store  Rev  OS  Notes

Unit  Workload  Agent  Machine  Public address  Ports  Message

Machine  State    DNS           Inst id        Series  AZ  Message
0        started  10.16.17.214  juju-10a670-0  xenial      Running
2        started  10.16.17.189  juju-10a670-2  xenial      Running
3        started  10.16.17.42   juju-10a670-3  xenial      Running

Anastasia (anastasia-macmood) on 2018-07-10

Changed in juju:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.