MAAS

Juju incorrectly placed a unit onto an existing machine

Bug #1835823 reported by Dmitrii Shcherbakov on 2019-07-08

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Triaged	Low	Unassigned
	MAAS	Invalid	High	Newell Jensen

Bug Description

Re-targeted to Juju (2.6.5) based on https://bugs.launchpad.net/maas/+bug/1835823/comments/2 (see this comment for details)

Old title: "maas reported 409 CONFLICT from "allocate" while the node matching constraints was available"

https://solutions.qa.canonical.com/#/qa/testRun/808239c1-7b06-4281-9267-0f09421604a1
https://oil-jenkins.canonical.com/artifacts/808239c1-7b06-4281-9267-0f09421604a1/index.html (deployment artifacts)

Juju tried 10 times to allocate a machine with tag "prometheus" but MAAS responded with 509s to all requests so Juju eventually marked the machine as failed. The machine is a pod VM (the host is in zone-3). Based on the DB dump it looks like it exists, is ready, has the right tag and interface with access to the right space.

8cf17ecd-809f-4c14-8220-09b6b66bbc74: machine-0 2019-07-05 19:55:24 WARNING juju.worker.provisioner provisioner_task.go:1157 failed to start machine 3 (failed to acquire node: No available machine matches constraints: [('agent_name', ['85f7eddb-f1bb-45fd-8eba-e9e50b532eda']), ('interfaces', ['internal:space=2;nrpe-external-master:space=3;target:space=3;compute-peer:space=3;prometheus-rules:space=3;website:space=3;amqp:space=3;secrets-storage:space=3;neutron-plugin:space=3;nova-ceilometer:space=3;cloud-compute:space=3;ephemeral-backend:space=3;snmp-exporter:space=3;grafana-source:space=3;image-service:space=3;lxd:space=3;scrape:space=3;blackbox-exporter:space=3;cloud-credentials:space=3;ceph-access:space=3;0:space=3;alertmanager-service:space=3;ceph:space=3']), ('tags', ['prometheus']), ('zone', ['zone3'])] (resolved to "interfaces=internal:space=2;nrpe-external-master:space=3;target:space=3;compute-peer:space=3;prometheus-rules:space=3;website:space=3;amqp:space=3;secrets-storage:space=3;neutron-plugin:space=3;nova-ceilometer:space=3;cloud-compute:space=3;ephemeral-backend:space=3;snmp-exporter:space=3;grafana-source:space=3;image-service:space=3;lxd:space=3;scrape:space=3;blackbox-exporter:space=3;cloud-credentials:space=3;ceph-access:space=3;0:space=3;alertmanager-service:space=3;ceph:space=3 tags=prometheus zone=zone3")), retrying in 10s (10 more attempts)

2019-07-05 19:55:23 regiond: [info] 10.246.64.6 POST /MAAS/api/2.0/machines/?op=allocate HTTP/1.1 --> 200 OK (referrer: -; agent: Go-http-client/1.1)
2019-07-05 19:55:24 regiond: [info] 10.246.64.6 POST /MAAS/api/2.0/machines/?op=allocate HTTP/1.1 --> 409 CONFLICT (referrer: -; agent: Go-http-client/1.1)

select x.hostname,x.status,x.zone_id,x.name from (select * from maasserver_node inner join maasserver_zone on maasserver_node.zone_id = maasserver_zone.id) as x where x.hostname LIKE '%prometheus%';
hostname | status | zone_id | name
--------------+--------+---------+-------
prometheus-3 | 4 | 3 | zone3
(1 row)

src/maasserver/enum.py
class NODE_STATUS:
# ...
READY = 4
#: The node is ready for named deployment.

maasdb=# select id from maasserver_node where hostname = 'prometheus-3';
id
----
13
(1 row)

maasdb=# select * from maasserver_space;
id | created | updated | name | description
----+-------------------------------+-------------------------------+--------------------+-----------------------------------------------------
  1 | 2019-07-05 19:02:02.797481+00 | 2019-07-05 19:02:02.797481+00 | ps45_routers | foundation-engine created space: ps45_routers
  2 | 2019-07-05 19:02:03.603296+00 | 2019-07-05 19:02:03.603296+00 | internal-space | foundation-engine created space: internal-space
  3 | 2019-07-05 19:02:05.074262+00 | 2019-07-05 19:02:05.074262+00 | oam-space | foundation-engine created space: oam-space
  4 | 2019-07-05 19:02:06.108567+00 | 2019-07-05 19:02:06.108567+00 | external-space | foundation-engine created space: external-space
  5 | 2019-07-05 19:02:06.922964+00 | 2019-07-05 19:02:06.922964+00 | ceph-replica-space | foundation-engine created space: ceph-replica-space
  6 | 2019-07-05 19:02:07.80897+00 | 2019-07-05 19:02:07.80897+00 | ceph-access-space | foundation-engine created space: ceph-access-space
(6 rows)

maasdb=# select vlan_id from maasserver_interface where node_id = 13;
vlan_id
---------
5002
(1 row)

maasdb=# select id,space_id from maasserver_vlan where id = 5002;
id | space_id
------+----------
5002 | 3
(1 row)

maasdb=# select node_id,tag_id from maasserver_node inner join maasserver_node_tags on maasserver_node.id = maasserver_node_tags.node_id where maasserver_node.id = 13;
node_id | tag_id
---------+--------
      13 | 2
      13 | 1
      13 | 9
(3 rows)

See original description

Tags:

Dmitrii Shcherbakov (dmitriis) on 2019-07-08

description:

updated

Adam Collard (adam-collard) on 2019-08-19

Changed in maas:
assignee:	nobody → Newell Jensen (newell-jensen)

Adam Collard (adam-collard) on 2019-08-22

Changed in maas:
status:	New → Incomplete
status:	Incomplete → Triaged
importance:	Undecided → High

Newell Jensen (newell-jensen) on 2019-08-22

Changed in maas:
status:	Triaged → In Progress
milestone:	none → 2.7.0alpha1

Revision history for this message

Newell Jensen (newell-jensen) wrote on 2019-08-27:

Dmitrii,

Can you verify that the machine also matches the interface constraints that you are trying to allocate the machine with? From your constraints, interfaces need to match:

'internal:space=2;nrpe-external-master:space=3;target:space=3;compute-peer:space=3;prometheus-rules:space=3;website:space=3;amqp:space=3;secrets-storage:space=3;neutron-plugin:space=3;nova-ceilometer:space=3;cloud-compute:space=3;ephemeral-backend:space=3;snmp-exporter:space=3;grafana-source:space=3;image-service:space=3;lxd:space=3;scrape:space=3;blackbox-exporter:space=3;cloud-credentials:space=3;ceph-access:space=3;0:space=3;alertmanager-service:space=3;ceph:space=3'

Changed in maas:
status:	In Progress → Incomplete

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2019-08-28:

Download full text (3.6 KiB)

Hmm, indeed. The "prometheus" application (prometheus2 charm) in the bundle does not even have the "internal" endpoint in its metadata yaml - there are a lot of endpoints that are related to nova-compute-kvm's metadata.yaml file.

Based on what I see it looks like a Juju bug. There were 6 nova-compute-kvm units but only 5 placement directives were specified in the bundle. While Juju reported that it will add a new machine for the 6th unit ("add unit nova-compute-kvm/5 to new machine 13") it actually used machine 3 which is allocated to prometheus.

nova-compute-kvm/5 waiting allocating 3 waiting for machine

prometheus/0 waiting allocating 3 waiting for machine

From the bundle:

variables:
oam-space: &oam-space oam-space
internal-space: &internal-space internal-space

machines: # see https://oil-jenkins.canonical.com/artifacts/808239c1-7b06-4281-9267-0f09421604a1/config/config/bundle.yaml
# ...

applications:
# ...

# 6 units, 5 machines which is is a bundle problem but either way Juju should have caught it
  nova-compute-kvm:
    charm: cs:nova-compute
    num_units: 6
    bindings:
      "": *oam-space
      internal: *internal-space
# ...
    to:
    - 1000
    - 1001
    - 1002
    - 1003
    - 1004

  prometheus:
    charm: cs:prometheus2
    series: bionic
    bindings:
      "": *oam-space
    num_units: 1
    to:
    - 9

Deployment-time messages:

19:54:03 DEBUG juju.cmd.juju.application bundle.go:838 created new machine 6 for holding ceph-mon, ceph-osd, ceph-radosgw, cinder, glance, heat, keystone, mysql, nova-compute-kvm and openstack-service-checks units

19:54:03 DEBUG juju.cmd.juju.application bundle.go:838 created new machine 7 for holding ceph-mon, ceph-osd, ceph-radosgw, cinder, designate-bind, glance, heat, keystone, mysql, nova-compute-kvm and prometheus-ceph-exporter units

19:54:04 DEBUG juju.cmd.juju.application bundle.go:838 created new machine 8 for holding ceph-mon, ceph-osd, ceph-radosgw, cinder, designate-bind, glance, heat, keystone, mysql, nova-compute-kvm and prometheus-openstack-exporter units

19:54:04 DEBUG juju.cmd.juju.application bundle.go:838 created new machine 9 for holding aodh, ceilometer, ceph-osd, designate, gnocchi, neutron-api, nova-cloud-controller, nova-compute-kvm, openstack-dashboard and rabbitmq-server units

19:54:04 DEBUG juju.cmd.juju.application bundle.go:838 created new machine 10 for holding ceph-osd and nova-compute-kvm units

19:54:34 - add unit nova-compute-kvm/0 to new machine 6

19:54:34 DEBUG juju.cmd.juju.application bundle.go:921 added nova-compute-kvm/0 unit to new machine
- add unit nova-compute-kvm/1 to new machine 7

19:54:35 DEBUG juju.cmd.juju.application - add unit nova-compute-kvm/2 to new machine 8
bundle.go:921 added nova-compute-kvm/1 unit to new machine

- add unit nova-compute-kvm/3 to new machine 9
19:54:35 DEBUG juju.cmd.juju.application bundle.go:921 added nova-compute-kvm/2 unit to new machine

- add unit nova-compute-kvm/4 to new machine 10
19:54:36 DEB...

nova-compute-kvm/5                waiting   allocating  3                                                                  waiting for machine

prometheus/0                      waiting   allocating  3                                                                  waiting for machine

From the bundle:

variables:
  oam-space:           &oam-space           oam-space
  internal-space:      &internal-space      internal-space

machines: # see https://oil-jenkins.canonical.com/artifacts/808239c1-7b06-4281-9267-0f09421604a1/config/config/bundle.yaml
# ...

applications:
# ...

prometheus:
    charm: cs:prometheus2
    series: bionic
    bindings:
      "": *oam-space
    num_units: 1
    to:
    - 9

Deployment-time messages:

19:54:04 DEBUG juju.cmd.juju.application bundle.go:838 created new machine 10 for holding ceph-osd and nova-compute-kvm units

19:54:34 - add unit nova-compute-kvm/0 to new machine 6

19:54:34 DEBUG juju.cmd.juju.application bundle.go:921 added nova-compute-kvm/0 unit to new machine
- add unit nova-compute-kvm/1 to new machine 7

19:54:35 DEBUG juju.cmd.juju.application - add unit nova-compute-kvm/2 to new machine 8
bundle.go:921 added nova-compute-kvm/1 unit to new machine

- add unit nova-compute-kvm/3 to new machine 9
19:54:35 DEBUG juju.cmd.juju.application bundle.go:921 added nova-compute-kvm/2 unit to new machine

- add unit nova-compute-kvm/4 to new machine 10
19:54:36 DEBUG juju.cmd.juju.application bundle.go:921 added nova-compute-kvm/3 unit to new machine

- add unit nova-compute-kvm/5 to new machine 13
19:54:36 DEBUG juju.cmd.juju.application bundle.go:921 added nova-compute-kvm/4 unit to new machine

19:54:34 - add unit nova-compute-kvm/0 to new machine 6

19:54:34 DEBUG juju.cmd.juju.application bundle.go:921 added nova-compute-kvm/0 unit to new machine
- add unit nova-compute-kvm/1 to new machine 7

description:	updated
summary:	- maas reported 409 CONFLICT from "allocate" while the node matching - constraints was available + Juju incorrectly placed a unit onto an existing machine
description:	updated

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2019-08-28:

We will update the bundle template on our end to include one more placement directive for nova-compute-kvm but Juju should either error out if there are not enough placement directives (according to num_units) or correctly place units for which there are no placement directives onto new machines.

Dmitrii Shcherbakov (dmitriis) on 2019-08-28

description:

updated

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2019-08-28:

Also, I think this bug can be marked as invalid for MAAS based on comment #2.

Newell Jensen (newell-jensen) on 2019-08-28

Changed in maas:
status:	Incomplete → Invalid

Alberto Donato (ack) on 2019-10-01

Changed in maas:
milestone:	2.7.0alpha1 → none

Revision history for this message

Tim Penhey (thumper) wrote on 2019-10-14:

Is there still a juju issue here?

Changed in juju:
status:	New → Incomplete

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2019-10-14:

Yes, per #3, either Juju needs to error out if there are not enough placement directives (according to num_units) or correctly place units for which there are no placement directives onto new machines.

Changed in juju:
status:	Incomplete → New

Revision history for this message

Tim Penhey (thumper) wrote on 2019-10-14:

Oh, I didn't realise that it was a bundle issue.

This becomes more interesting because of placement directives where you specify an application.

I do wonder if anyone actually relies on the behaviour of "use the last placement directive" for extra units.

There are two primary places where this is actually useful: "lxd:new" and "appname". This means put new units in lxd containers on new machines, and colocate with another application.

Changing this would be a breaking change in the current behaviour, not something we want to do in 2.x. Perhaps a more interesting thing would be some validator to run over the bundle where you could specify the strictness.

Revision history for this message

John A Meinel (jameinel) wrote on 2019-10-17: Re: [Bug 1835823] Re: Juju incorrectly placed a unit onto an existing machine

It may be that we should not reuse last if it was an explicit machine?

On Tue, Oct 15, 2019 at 12:05 AM Tim Penhey <email address hidden>
wrote:

> Oh, I didn't realise that it was a bundle issue.
>
> This becomes more interesting because of placement directives where you
> specify an application.
>
> I do wonder if anyone actually relies on the behaviour of "use the last
> placement directive" for extra units.
>
> There are two primary places where this is actually useful: "lxd:new"
> and "appname". This means put new units in lxd containers on new
> machines, and colocate with another application.
>
> Changing this would be a breaking change in the current behaviour, not
> something we want to do in 2.x. Perhaps a more interesting thing would
> be some validator to run over the bundle where you could specify the
> strictness.
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1835823
>
> Title:
> Juju incorrectly placed a unit onto an existing machine
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1835823/+subscriptions
>

Richard Harding (rharding) on 2020-02-03

Changed in juju:
status:	New → Triaged
importance:	Undecided → Medium

Revision history for this message

Canonical Juju QA Bot (juju-qa-bot) wrote on 2022-11-03:

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance:	Medium → Low
tags:	added: expirebugs-bot

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.