Running Juju ensure-availability twice in a row adds extra machines
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| juju |
Medium
|
Unassigned | ||
| juju-core |
Medium
|
Unassigned |
Bug Description
There appears to be a race condition of some kind with juju HA bootstrap and MaaS. After a bootstrap, if you run a juju ensure-
Its important to note here that we only have 3 nodes in MaaS tagged with openstack-ha, they're the units we want to use for bootstrap HA.
$ juju ensure-availability --constraints "tags=openstack-ha"
maintaining machines: 0
adding machines: 1, 2
$ juju status
environment: staging-bootstack
machines:
"0":
agent-state: started
agent-version: 1.20.9.1
dns-name: apollo.maas
instance-id: /MAAS/api/
series: trusty
hardware: arch=amd64 cpu-cores=8 mem=32768M tags=use-
state-
"1":
agent-state: started
agent-version: 1.20.9.1
dns-name: ceco.maas
instance-id: /MAAS/api/
series: trusty
hardware: arch=amd64 cpu-cores=8 mem=32768M tags=use-
state-
"2":
agent-state: started
agent-version: 1.20.9.1
dns-name: altman.maas
instance-id: /MAAS/api/
series: trusty
hardware: arch=amd64 cpu-cores=8 mem=32768M tags=use-
state-
services: {}
$ juju ensure-availability --constraints "tags=openstack-ha"
maintaining machines: 0, 2
adding machines: 3
demoting machines 1
$ juju status
environment: staging-bootstack
machines:
"0":
agent-state: started
agent-version: 1.20.9.1
dns-name: apollo.maas
instance-id: /MAAS/api/
series: trusty
hardware: arch=amd64 cpu-cores=8 mem=32768M tags=use-
state-
"1":
agent-state: started
agent-version: 1.20.9.1
dns-name: ceco.maas
instance-id: /MAAS/api/
series: trusty
hardware: arch=amd64 cpu-cores=8 mem=32768M tags=use-
state-
"2":
agent-state: started
agent-version: 1.20.9.1
dns-name: altman.maas
instance-id: /MAAS/api/
series: trusty
hardware: arch=amd64 cpu-cores=8 mem=32768M tags=use-
state-
"3":
agent-
409 CONFLICT (No matching node is available.)'
instance-id: pending
series: trusty
state-
services: {}
$ juju ensure-availability --constraints "tags=openstack-ha"
maintaining machines: 0, 2
promoting machines 1
demoting machines 3
$ juju status
environment: staging-bootstack
machines:
"0":
agent-state: started
agent-version: 1.20.9.1
dns-name: apollo.maas
instance-id: /MAAS/api/
series: trusty
hardware: arch=amd64 cpu-cores=8 mem=32768M tags=use-
state-
"1":
agent-state: started
agent-version: 1.20.9.1
dns-name: ceco.maas
instance-id: /MAAS/api/
series: trusty
hardware: arch=amd64 cpu-cores=8 mem=32768M tags=use-
state-
"2":
agent-state: started
agent-version: 1.20.9.1
dns-name: altman.maas
instance-id: /MAAS/api/
series: trusty
hardware: arch=amd64 cpu-cores=8 mem=32768M tags=use-
state-
"3":
agent-
409 CONFLICT (No matching node is available.)'
instance-id: pending
series: trusty
state-
services: {}
$ juju ensure-availability --constraints "tags=openstack-ha"
maintaining machines: 0, 1, 2
removing machines 3
$ juju ensure-availability --constraints "tags=openstack-ha"
$ echo $?
Comparing this to a working situation, where we wait for the nodes to settle:
$ juju ensure-availability --constraints "tags=openstack-ha"
maintaining machines: 0
adding machines: 1, 2
$ juju status
environment: staging-bootstack
machines:
"0":
agent-state: started
agent-version: 1.20.9.1
dns-name: apollo.maas
instance-id: /MAAS/api/
series: trusty
hardware: arch=amd64 cpu-cores=8 mem=32768M tags=use-
state-
"1":
agent-state: started
agent-version: 1.20.9.1
dns-name: altman.maas
instance-id: /MAAS/api/
series: trusty
hardware: arch=amd64 cpu-cores=8 mem=32768M tags=use-
state-
"2":
agent-state: started
agent-version: 1.20.9.1
dns-name: ceco.maas
instance-id: /MAAS/api/
series: trusty
hardware: arch=amd64 cpu-cores=8 mem=32768M tags=use-
state-
services: {}
$ juju ensure-availability --constraints "tags=openstack-ha"
$ juju ensure-availability --constraints "tags=openstack-ha"
$ echo $?
0
This occasionally occurs while doing openstack deployments where there are errors, so we end up having to tidy up the machines afterwards.
I'm not exactly sure how to fix this, since I could imagine we could be in a state where a unit is stuck in adding-vote, and we do actually want to remove it.
Changed in juju-core: | |
status: | New → Triaged |
importance: | Undecided → High |
tags: | added: ha |
tags: | added: canonical-is maas-provider |
Changed in juju-core: | |
milestone: | none → next-stable |
Changed in juju-core: | |
milestone: | 1.21 → 1.22 |
Changed in juju-core: | |
milestone: | 1.22-alpha1 → 1.23 |
Changed in juju-core: | |
milestone: | 1.23 → 1.24-alpha1 |
Changed in juju-core: | |
milestone: | 1.24-alpha1 → 1.25.0 |
summary: |
- Juju bootstrap HA mode with MaaS occasionally tries to create extra - machines + Running Juju ensure-availability twice in a row adds extra machines |
Nate Finch (natefinch) wrote : | #2 |
This does not appear to have anything specific to do with MAAS. This is a known issue with ensure-
Curtis Hovey (sinzui) wrote : | #3 |
Juju currently lacks infrastructure to know when a machine is still coming up and it failed. We need a few weeks to address this issue which makes a fix for 1.24 risky.
no longer affects: | juju-core/1.24 |
tags: | added: improvement |
Changed in juju-core: | |
milestone: | 1.25.0 → none |
importance: | High → Medium |
Changed in juju: | |
status: | New → Triaged |
importance: | Undecided → High |
milestone: | none → 2.1.0 |
Changed in juju-core: | |
status: | Triaged → Won't Fix |
Anastasia (anastasia-macmood) wrote : | #4 |
Since this bug was filed and was last commented on, we got better at knowing when machines are up and whether they have failed. We need to take advantage of this when enabling HA.
However, since the work around for this exist - wait for machines to come up, I am lowering the importance to Medium.
Changed in juju: | |
importance: | High → Medium |
milestone: | 2.1.0 → none |
We won't be able to fix this for 1.23, so leaving as targetted at 1.24