Running Juju ensure-availability twice in a row adds extra machines

Bug #1384549 reported by Brad Marshall on 2014-10-23
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju
Medium
Unassigned
juju-core
Medium
Unassigned

Bug Description

There appears to be a race condition of some kind with juju HA bootstrap and MaaS. After a bootstrap, if you run a juju ensure-availability, and then re-run it after, but before the new units have settled - ie, when the units are still in adding-vote - it tries to add another node to use for bootstrap.

Its important to note here that we only have 3 nodes in MaaS tagged with openstack-ha, they're the units we want to use for bootstrap HA.

$ juju ensure-availability --constraints "tags=openstack-ha"
maintaining machines: 0
adding machines: 1, 2

$ juju status
environment: staging-bootstack
machines:
  "0":
    agent-state: started
    agent-version: 1.20.9.1
    dns-name: apollo.maas
    instance-id: /MAAS/api/1.0/nodes/node-4db79f3c-428d-11e4-a975-0cc47a01e385/
    series: trusty
    hardware: arch=amd64 cpu-cores=8 mem=32768M tags=use-fastpath-installer,bootstrap,openstack-ha
    state-server-member-status: has-vote
  "1":
    agent-state: started
    agent-version: 1.20.9.1
    dns-name: ceco.maas
    instance-id: /MAAS/api/1.0/nodes/node-4f52fbca-428d-11e4-a975-0cc47a01e385/
    series: trusty
    hardware: arch=amd64 cpu-cores=8 mem=32768M tags=use-fastpath-installer,openstack-ha
    state-server-member-status: adding-vote
  "2":
    agent-state: started
    agent-version: 1.20.9.1
    dns-name: altman.maas
    instance-id: /MAAS/api/1.0/nodes/node-4e866af6-428d-11e4-ae9d-0cc47a01e385/
    series: trusty
    hardware: arch=amd64 cpu-cores=8 mem=32768M tags=use-fastpath-installer,openstack-ha
    state-server-member-status: adding-vote
services: {}

$ juju ensure-availability --constraints "tags=openstack-ha"
maintaining machines: 0, 2
adding machines: 3
demoting machines 1

$ juju status
environment: staging-bootstack
machines:
  "0":
    agent-state: started
    agent-version: 1.20.9.1
    dns-name: apollo.maas
    instance-id: /MAAS/api/1.0/nodes/node-4db79f3c-428d-11e4-a975-0cc47a01e385/
    series: trusty
    hardware: arch=amd64 cpu-cores=8 mem=32768M tags=use-fastpath-installer,bootstrap,openstack-ha
    state-server-member-status: has-vote
  "1":
    agent-state: started
    agent-version: 1.20.9.1
    dns-name: ceco.maas
    instance-id: /MAAS/api/1.0/nodes/node-4f52fbca-428d-11e4-a975-0cc47a01e385/
    series: trusty
    hardware: arch=amd64 cpu-cores=8 mem=32768M tags=use-fastpath-installer,openstack-ha
    state-server-member-status: removing-vote
  "2":
    agent-state: started
    agent-version: 1.20.9.1
    dns-name: altman.maas
    instance-id: /MAAS/api/1.0/nodes/node-4e866af6-428d-11e4-ae9d-0cc47a01e385/
    series: trusty
    hardware: arch=amd64 cpu-cores=8 mem=32768M tags=use-fastpath-installer,openstack-ha
    state-server-member-status: has-vote
  "3":
    agent-state-info: 'cannot run instances: gomaasapi: got error back from server:
      409 CONFLICT (No matching node is available.)'
    instance-id: pending
    series: trusty
    state-server-member-status: adding-vote
services: {}
$ juju ensure-availability --constraints "tags=openstack-ha"
maintaining machines: 0, 2
promoting machines 1
demoting machines 3

$ juju status
environment: staging-bootstack
machines:
  "0":
    agent-state: started
    agent-version: 1.20.9.1
    dns-name: apollo.maas
    instance-id: /MAAS/api/1.0/nodes/node-4db79f3c-428d-11e4-a975-0cc47a01e385/
    series: trusty
    hardware: arch=amd64 cpu-cores=8 mem=32768M tags=use-fastpath-installer,bootstrap,openstack-ha
    state-server-member-status: has-vote
  "1":
    agent-state: started
    agent-version: 1.20.9.1
    dns-name: ceco.maas
    instance-id: /MAAS/api/1.0/nodes/node-4f52fbca-428d-11e4-a975-0cc47a01e385/
    series: trusty
    hardware: arch=amd64 cpu-cores=8 mem=32768M tags=use-fastpath-installer,openstack-ha
    state-server-member-status: has-vote
  "2":
    agent-state: started
    agent-version: 1.20.9.1
    dns-name: altman.maas
    instance-id: /MAAS/api/1.0/nodes/node-4e866af6-428d-11e4-ae9d-0cc47a01e385/
    series: trusty
    hardware: arch=amd64 cpu-cores=8 mem=32768M tags=use-fastpath-installer,openstack-ha
    state-server-member-status: has-vote
  "3":
    agent-state-info: 'cannot run instances: gomaasapi: got error back from server:
      409 CONFLICT (No matching node is available.)'
    instance-id: pending
    series: trusty
    state-server-member-status: no-vote
services: {}

$ juju ensure-availability --constraints "tags=openstack-ha"
maintaining machines: 0, 1, 2
removing machines 3

$ juju ensure-availability --constraints "tags=openstack-ha"
$ echo $?

Comparing this to a working situation, where we wait for the nodes to settle:

$ juju ensure-availability --constraints "tags=openstack-ha"
maintaining machines: 0
adding machines: 1, 2

$ juju status
environment: staging-bootstack
machines:
  "0":
    agent-state: started
    agent-version: 1.20.9.1
    dns-name: apollo.maas
    instance-id: /MAAS/api/1.0/nodes/node-4db79f3c-428d-11e4-a975-0cc47a01e385/
    series: trusty
    hardware: arch=amd64 cpu-cores=8 mem=32768M tags=use-fastpath-installer,bootstrap,openstack-ha
    state-server-member-status: has-vote
  "1":
    agent-state: started
    agent-version: 1.20.9.1
    dns-name: altman.maas
    instance-id: /MAAS/api/1.0/nodes/node-4e866af6-428d-11e4-ae9d-0cc47a01e385/
    series: trusty
    hardware: arch=amd64 cpu-cores=8 mem=32768M tags=use-fastpath-installer,openstack-ha
    state-server-member-status: has-vote
  "2":
    agent-state: started
    agent-version: 1.20.9.1
    dns-name: ceco.maas
    instance-id: /MAAS/api/1.0/nodes/node-4f52fbca-428d-11e4-a975-0cc47a01e385/
    series: trusty
    hardware: arch=amd64 cpu-cores=8 mem=32768M tags=use-fastpath-installer,openstack-ha
    state-server-member-status: has-vote
services: {}

$ juju ensure-availability --constraints "tags=openstack-ha"
$ juju ensure-availability --constraints "tags=openstack-ha"
$ echo $?
0

This occasionally occurs while doing openstack deployments where there are errors, so we end up having to tidy up the machines afterwards.

I'm not exactly sure how to fix this, since I could imagine we could be in a state where a unit is stuck in adding-vote, and we do actually want to remove it.

Abel Deuring (adeuring) on 2014-10-23
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
tags: added: ha
Curtis Hovey (sinzui) on 2014-10-23
tags: added: canonical-is maas-provider
Changed in juju-core:
milestone: none → next-stable
Curtis Hovey (sinzui) on 2014-12-03
Changed in juju-core:
milestone: 1.21 → 1.22
Changed in juju-core:
milestone: 1.22-alpha1 → 1.23
Ian Booth (wallyworld) on 2015-03-27
Changed in juju-core:
milestone: 1.23 → 1.24-alpha1
Ian Booth (wallyworld) wrote :

We won't be able to fix this for 1.23, so leaving as targetted at 1.24

no longer affects: juju-core/1.23
Curtis Hovey (sinzui) on 2015-04-27
Changed in juju-core:
milestone: 1.24-alpha1 → 1.25.0
Nate Finch (natefinch) on 2015-04-29
summary: - Juju bootstrap HA mode with MaaS occasionally tries to create extra
- machines
+ Running Juju ensure-availability twice in a row adds extra machines
Nate Finch (natefinch) wrote :

This does not appear to have anything specific to do with MAAS. This is a known issue with ensure-availability. I changed the bug title to reflect that. The problem is that the second time you run ensure-availability, we look to see if we have a quorum of state servers, see that we don't, and so we try to rectify the situation by adding more.... currently we don't have a way to say "hey, we're still trying to bring up some new servers, so just hold your horses!"

Curtis Hovey (sinzui) wrote :

Juju currently lacks infrastructure to know when a machine is still coming up and it failed. We need a few weeks to address this issue which makes a fix for 1.24 risky.

Curtis Hovey (sinzui) on 2015-05-06
no longer affects: juju-core/1.24
Curtis Hovey (sinzui) on 2015-06-02
tags: added: improvement
Changed in juju-core:
milestone: 1.25.0 → none
importance: High → Medium
Changed in juju:
status: New → Triaged
importance: Undecided → High
milestone: none → 2.1.0
Changed in juju-core:
status: Triaged → Won't Fix
Anastasia (anastasia-macmood) wrote :

Since this bug was filed and was last commented on, we got better at knowing when machines are up and whether they have failed. We need to take advantage of this when enabling HA.

However, since the work around for this exist - wait for machines to come up, I am lowering the importance to Medium.

Changed in juju:
importance: High → Medium
milestone: 2.1.0 → none
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Related blueprints