Canonical Juju

Running Juju ensure-availability twice in a row adds extra machines

Bug #1384549 reported by Brad Marshall on 2014-10-23

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Expired	Medium	Unassigned
	juju-core	Won't Fix	Medium	Unassigned

Bug Description

There appears to be a race condition of some kind with juju HA bootstrap and MaaS. After a bootstrap, if you run a juju ensure-availability, and then re-run it after, but before the new units have settled - ie, when the units are still in adding-vote - it tries to add another node to use for bootstrap.

Its important to note here that we only have 3 nodes in MaaS tagged with openstack-ha, they're the units we want to use for bootstrap HA.

$ juju ensure-availability --constraints "tags=openstack-ha"
maintaining machines: 0
adding machines: 1, 2

$ juju status
environment: staging-bootstack
machines:
  "0":
    agent-state: started
    agent-version: 1.20.9.1
    dns-name: apollo.maas
    instance-id: /MAAS/api/1.0/nodes/node-4db79f3c-428d-11e4-a975-0cc47a01e385/
    series: trusty
    hardware: arch=amd64 cpu-cores=8 mem=32768M tags=use-fastpath-installer,bootstrap,openstack-ha
    state-server-member-status: has-vote
  "1":
    agent-state: started
    agent-version: 1.20.9.1
    dns-name: ceco.maas
    instance-id: /MAAS/api/1.0/nodes/node-4f52fbca-428d-11e4-a975-0cc47a01e385/
    series: trusty
    hardware: arch=amd64 cpu-cores=8 mem=32768M tags=use-fastpath-installer,openstack-ha
    state-server-member-status: adding-vote
  "2":
    agent-state: started
    agent-version: 1.20.9.1
    dns-name: altman.maas
    instance-id: /MAAS/api/1.0/nodes/node-4e866af6-428d-11e4-ae9d-0cc47a01e385/
    series: trusty
    hardware: arch=amd64 cpu-cores=8 mem=32768M tags=use-fastpath-installer,openstack-ha
    state-server-member-status: adding-vote
services: {}

$ juju ensure-availability --constraints "tags=openstack-ha"
maintaining machines: 0, 2
adding machines: 3
demoting machines 1

$ juju status
environment: staging-bootstack
machines:
  "0":
    agent-state: started
    agent-version: 1.20.9.1
    dns-name: apollo.maas
    instance-id: /MAAS/api/1.0/nodes/node-4db79f3c-428d-11e4-a975-0cc47a01e385/
    series: trusty
    hardware: arch=amd64 cpu-cores=8 mem=32768M tags=use-fastpath-installer,bootstrap,openstack-ha
    state-server-member-status: has-vote
  "1":
    agent-state: started
    agent-version: 1.20.9.1
    dns-name: ceco.maas
    instance-id: /MAAS/api/1.0/nodes/node-4f52fbca-428d-11e4-a975-0cc47a01e385/
    series: trusty
    hardware: arch=amd64 cpu-cores=8 mem=32768M tags=use-fastpath-installer,openstack-ha
    state-server-member-status: removing-vote
  "2":
    agent-state: started
    agent-version: 1.20.9.1
    dns-name: altman.maas
    instance-id: /MAAS/api/1.0/nodes/node-4e866af6-428d-11e4-ae9d-0cc47a01e385/
    series: trusty
    hardware: arch=amd64 cpu-cores=8 mem=32768M tags=use-fastpath-installer,openstack-ha
    state-server-member-status: has-vote
  "3":
    agent-state-info: 'cannot run instances: gomaasapi: got error back from server:
      409 CONFLICT (No matching node is available.)'
    instance-id: pending
    series: trusty
    state-server-member-status: adding-vote
services: {}
$ juju ensure-availability --constraints "tags=openstack-ha"
maintaining machines: 0, 2
promoting machines 1
demoting machines 3

$ juju status
environment: staging-bootstack
machines:
  "0":
    agent-state: started
    agent-version: 1.20.9.1
    dns-name: apollo.maas
    instance-id: /MAAS/api/1.0/nodes/node-4db79f3c-428d-11e4-a975-0cc47a01e385/
    series: trusty
    hardware: arch=amd64 cpu-cores=8 mem=32768M tags=use-fastpath-installer,bootstrap,openstack-ha
    state-server-member-status: has-vote
  "1":
    agent-state: started
    agent-version: 1.20.9.1
    dns-name: ceco.maas
    instance-id: /MAAS/api/1.0/nodes/node-4f52fbca-428d-11e4-a975-0cc47a01e385/
    series: trusty
    hardware: arch=amd64 cpu-cores=8 mem=32768M tags=use-fastpath-installer,openstack-ha
    state-server-member-status: has-vote
  "2":
    agent-state: started
    agent-version: 1.20.9.1
    dns-name: altman.maas
    instance-id: /MAAS/api/1.0/nodes/node-4e866af6-428d-11e4-ae9d-0cc47a01e385/
    series: trusty
    hardware: arch=amd64 cpu-cores=8 mem=32768M tags=use-fastpath-installer,openstack-ha
    state-server-member-status: has-vote
  "3":
    agent-state-info: 'cannot run instances: gomaasapi: got error back from server:
      409 CONFLICT (No matching node is available.)'
    instance-id: pending
    series: trusty
    state-server-member-status: no-vote
services: {}

$ juju ensure-availability --constraints "tags=openstack-ha"
maintaining machines: 0, 1, 2
removing machines 3

$ juju ensure-availability --constraints "tags=openstack-ha"
$ echo $?

Comparing this to a working situation, where we wait for the nodes to settle:

$ juju ensure-availability --constraints "tags=openstack-ha"
maintaining machines: 0
adding machines: 1, 2

$ juju status
environment: staging-bootstack
machines:
  "0":
    agent-state: started
    agent-version: 1.20.9.1
    dns-name: apollo.maas
    instance-id: /MAAS/api/1.0/nodes/node-4db79f3c-428d-11e4-a975-0cc47a01e385/
    series: trusty
    hardware: arch=amd64 cpu-cores=8 mem=32768M tags=use-fastpath-installer,bootstrap,openstack-ha
    state-server-member-status: has-vote
  "1":
    agent-state: started
    agent-version: 1.20.9.1
    dns-name: altman.maas
    instance-id: /MAAS/api/1.0/nodes/node-4e866af6-428d-11e4-ae9d-0cc47a01e385/
    series: trusty
    hardware: arch=amd64 cpu-cores=8 mem=32768M tags=use-fastpath-installer,openstack-ha
    state-server-member-status: has-vote
  "2":
    agent-state: started
    agent-version: 1.20.9.1
    dns-name: ceco.maas
    instance-id: /MAAS/api/1.0/nodes/node-4f52fbca-428d-11e4-a975-0cc47a01e385/
    series: trusty
    hardware: arch=amd64 cpu-cores=8 mem=32768M tags=use-fastpath-installer,openstack-ha
    state-server-member-status: has-vote
services: {}

$ juju ensure-availability --constraints "tags=openstack-ha"
$ juju ensure-availability --constraints "tags=openstack-ha"
$ echo $?
0

This occasionally occurs while doing openstack deployments where there are errors, so we end up having to tidy up the machines afterwards.

I'm not exactly sure how to fix this, since I could imagine we could be in a state where a unit is stuck in adding-vote, and we do actually want to remove it.

Tags:

Abel Deuring (adeuring) on 2014-10-23

Changed in juju-core:
status:	New → Triaged
importance:	Undecided → High
tags:	added: ha

Curtis Hovey (sinzui) on 2014-10-23

tags:	added: canonical-is maas-provider
Changed in juju-core:
milestone:	none → next-stable

Curtis Hovey (sinzui) on 2014-12-03

Changed in juju-core:
milestone:	1.21 → 1.22

Canonical Juju QA Bot (juju-qa-bot) on 2015-01-13

Changed in juju-core:
milestone:	1.22-alpha1 → 1.23

Ian Booth (wallyworld) on 2015-03-27

Changed in juju-core:
milestone:	1.23 → 1.24-alpha1

Revision history for this message

Ian Booth (wallyworld) wrote on 2015-03-31:

We won't be able to fix this for 1.23, so leaving as targetted at 1.24

no longer affects:

juju-core/1.23

Curtis Hovey (sinzui) on 2015-04-27

Changed in juju-core:
milestone:	1.24-alpha1 → 1.25.0

Nate Finch (natefinch) on 2015-04-29

summary:

- Juju bootstrap HA mode with MaaS occasionally tries to create extra
- machines
+ Running Juju ensure-availability twice in a row adds extra machines

Revision history for this message

Nate Finch (natefinch) wrote on 2015-04-29:

This does not appear to have anything specific to do with MAAS. This is a known issue with ensure-availability. I changed the bug title to reflect that. The problem is that the second time you run ensure-availability, we look to see if we have a quorum of state servers, see that we don't, and so we try to rectify the situation by adding more.... currently we don't have a way to say "hey, we're still trying to bring up some new servers, so just hold your horses!"

Revision history for this message

Curtis Hovey (sinzui) wrote on 2015-05-06:

Juju currently lacks infrastructure to know when a machine is still coming up and it failed. We need a few weeks to address this issue which makes a fix for 1.24 risky.

Curtis Hovey (sinzui) on 2015-05-06

no longer affects:

juju-core/1.24

Curtis Hovey (sinzui) on 2015-06-02

tags:	added: improvement
Changed in juju-core:
milestone:	1.25.0 → none
importance:	High → Medium

Anastasia (anastasia-macmood) on 2016-10-18

Changed in juju:
status:	New → Triaged
importance:	Undecided → High
milestone:	none → 2.1.0
Changed in juju-core:
status:	Triaged → Won't Fix

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2017-02-08:

Since this bug was filed and was last commented on, we got better at knowing when machines are up and whether they have failed. We need to take advantage of this when enabling HA.

However, since the work around for this exist - wait for machines to come up, I am lowering the importance to Medium.

Changed in juju:
importance:	High → Medium
milestone:	2.1.0 → none

Revision history for this message

Canonical Juju QA Bot (juju-qa-bot) wrote on 2022-11-03:

This bug has not been updated in 5 years, so we're marking it Expired. If you believe this is incorrect, please update the status.

Changed in juju:
status:	Triaged → Expired
tags:	added: expirebugs-bot

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Related blueprints

Juju High Availability

Remote bug watches

Bug watches keep track of this bug in other bug trackers.