Juju incorrectly placed a unit onto an existing machine

Bug #1835823 reported by Dmitrii Shcherbakov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Low
Unassigned
MAAS
Invalid
High
Newell Jensen

Bug Description

Re-targeted to Juju (2.6.5) based on https://bugs.launchpad.net/maas/+bug/1835823/comments/2 (see this comment for details)

Old title: "maas reported 409 CONFLICT from "allocate" while the node matching constraints was available"

https://solutions.qa.canonical.com/#/qa/testRun/808239c1-7b06-4281-9267-0f09421604a1
https://oil-jenkins.canonical.com/artifacts/808239c1-7b06-4281-9267-0f09421604a1/index.html (deployment artifacts)

Juju tried 10 times to allocate a machine with tag "prometheus" but MAAS responded with 509s to all requests so Juju eventually marked the machine as failed. The machine is a pod VM (the host is in zone-3). Based on the DB dump it looks like it exists, is ready, has the right tag and interface with access to the right space.

8cf17ecd-809f-4c14-8220-09b6b66bbc74: machine-0 2019-07-05 19:55:24 WARNING juju.worker.provisioner provisioner_task.go:1157 failed to start machine 3 (failed to acquire node: No available machine matches constraints: [('agent_name', ['85f7eddb-f1bb-45fd-8eba-e9e50b532eda']), ('interfaces', ['internal:space=2;nrpe-external-master:space=3;target:space=3;compute-peer:space=3;prometheus-rules:space=3;website:space=3;amqp:space=3;secrets-storage:space=3;neutron-plugin:space=3;nova-ceilometer:space=3;cloud-compute:space=3;ephemeral-backend:space=3;snmp-exporter:space=3;grafana-source:space=3;image-service:space=3;lxd:space=3;scrape:space=3;blackbox-exporter:space=3;cloud-credentials:space=3;ceph-access:space=3;0:space=3;alertmanager-service:space=3;ceph:space=3']), ('tags', ['prometheus']), ('zone', ['zone3'])] (resolved to "interfaces=internal:space=2;nrpe-external-master:space=3;target:space=3;compute-peer:space=3;prometheus-rules:space=3;website:space=3;amqp:space=3;secrets-storage:space=3;neutron-plugin:space=3;nova-ceilometer:space=3;cloud-compute:space=3;ephemeral-backend:space=3;snmp-exporter:space=3;grafana-source:space=3;image-service:space=3;lxd:space=3;scrape:space=3;blackbox-exporter:space=3;cloud-credentials:space=3;ceph-access:space=3;0:space=3;alertmanager-service:space=3;ceph:space=3 tags=prometheus zone=zone3")), retrying in 10s (10 more attempts)

2019-07-05 19:55:23 regiond: [info] 10.246.64.6 POST /MAAS/api/2.0/machines/?op=allocate HTTP/1.1 --> 200 OK (referrer: -; agent: Go-http-client/1.1)
2019-07-05 19:55:24 regiond: [info] 10.246.64.6 POST /MAAS/api/2.0/machines/?op=allocate HTTP/1.1 --> 409 CONFLICT (referrer: -; agent: Go-http-client/1.1)

select x.hostname,x.status,x.zone_id,x.name from (select * from maasserver_node inner join maasserver_zone on maasserver_node.zone_id = maasserver_zone.id) as x where x.hostname LIKE '%prometheus%';
   hostname | status | zone_id | name
--------------+--------+---------+-------
 prometheus-3 | 4 | 3 | zone3
(1 row)

src/maasserver/enum.py
class NODE_STATUS:
# ...
    READY = 4
    #: The node is ready for named deployment.

maasdb=# select id from maasserver_node where hostname = 'prometheus-3';
 id
----
 13
(1 row)

maasdb=# select * from maasserver_space;
 id | created | updated | name | description
----+-------------------------------+-------------------------------+--------------------+-----------------------------------------------------
  1 | 2019-07-05 19:02:02.797481+00 | 2019-07-05 19:02:02.797481+00 | ps45_routers | foundation-engine created space: ps45_routers
  2 | 2019-07-05 19:02:03.603296+00 | 2019-07-05 19:02:03.603296+00 | internal-space | foundation-engine created space: internal-space
  3 | 2019-07-05 19:02:05.074262+00 | 2019-07-05 19:02:05.074262+00 | oam-space | foundation-engine created space: oam-space
  4 | 2019-07-05 19:02:06.108567+00 | 2019-07-05 19:02:06.108567+00 | external-space | foundation-engine created space: external-space
  5 | 2019-07-05 19:02:06.922964+00 | 2019-07-05 19:02:06.922964+00 | ceph-replica-space | foundation-engine created space: ceph-replica-space
  6 | 2019-07-05 19:02:07.80897+00 | 2019-07-05 19:02:07.80897+00 | ceph-access-space | foundation-engine created space: ceph-access-space
(6 rows)

maasdb=# select vlan_id from maasserver_interface where node_id = 13;
 vlan_id
---------
    5002
(1 row)

maasdb=# select id,space_id from maasserver_vlan where id = 5002;
  id | space_id
------+----------
 5002 | 3
(1 row)

maasdb=# select node_id,tag_id from maasserver_node inner join maasserver_node_tags on maasserver_node.id = maasserver_node_tags.node_id where maasserver_node.id = 13;
 node_id | tag_id
---------+--------
      13 | 2
      13 | 1
      13 | 9
(3 rows)

maasdb=# select * from maasserver_tag where name like '%prometheus%';
 id | created | updated | name | definition | comment | kernel_opts
----+-------------------------------+-------------------------------+------------+------------+---------+-------------
  9 | 2019-07-05 19:03:58.501961+00 | 2019-07-05 19:03:58.501961+00 | prometheus | | |
(1 row)

description: updated
Changed in maas:
assignee: nobody → Newell Jensen (newell-jensen)
Changed in maas:
status: New → Incomplete
status: Incomplete → Triaged
importance: Undecided → High
Changed in maas:
status: Triaged → In Progress
milestone: none → 2.7.0alpha1
Revision history for this message
Newell Jensen (newell-jensen) wrote :

Dmitrii,

Can you verify that the machine also matches the interface constraints that you are trying to allocate the machine with? From your constraints, interfaces need to match:

'internal:space=2;nrpe-external-master:space=3;target:space=3;compute-peer:space=3;prometheus-rules:space=3;website:space=3;amqp:space=3;secrets-storage:space=3;neutron-plugin:space=3;nova-ceilometer:space=3;cloud-compute:space=3;ephemeral-backend:space=3;snmp-exporter:space=3;grafana-source:space=3;image-service:space=3;lxd:space=3;scrape:space=3;blackbox-exporter:space=3;cloud-credentials:space=3;ceph-access:space=3;0:space=3;alertmanager-service:space=3;ceph:space=3'

Changed in maas:
status: In Progress → Incomplete
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :
Download full text (3.6 KiB)

Hmm, indeed. The "prometheus" application (prometheus2 charm) in the bundle does not even have the "internal" endpoint in its metadata yaml - there are a lot of endpoints that are related to nova-compute-kvm's metadata.yaml file.

Based on what I see it looks like a Juju bug. There were 6 nova-compute-kvm units but only 5 placement directives were specified in the bundle. While Juju reported that it will add a new machine for the 6th unit ("add unit nova-compute-kvm/5 to new machine 13") it actually used machine 3 which is allocated to prometheus.

nova-compute-kvm/5 waiting allocating 3 waiting for machine

prometheus/0 waiting allocating 3 waiting for machine

From the bundle:

variables:
  oam-space: &oam-space oam-space
  internal-space: &internal-space internal-space

machines: # see https://oil-jenkins.canonical.com/artifacts/808239c1-7b06-4281-9267-0f09421604a1/config/config/bundle.yaml
# ...

applications:
# ...

# 6 units, 5 machines which is is a bundle problem but either way Juju should have caught it
  nova-compute-kvm:
    charm: cs:nova-compute
    num_units: 6
    bindings:
      "": *oam-space
      internal: *internal-space
# ...
    to:
    - 1000
    - 1001
    - 1002
    - 1003
    - 1004

  prometheus:
    charm: cs:prometheus2
    series: bionic
    bindings:
      "": *oam-space
    num_units: 1
    to:
    - 9

Deployment-time messages:

19:54:03 DEBUG juju.cmd.juju.application bundle.go:838 created new machine 6 for holding ceph-mon, ceph-osd, ceph-radosgw, cinder, glance, heat, keystone, mysql, nova-compute-kvm and openstack-service-checks units

19:54:03 DEBUG juju.cmd.juju.application bundle.go:838 created new machine 7 for holding ceph-mon, ceph-osd, ceph-radosgw, cinder, designate-bind, glance, heat, keystone, mysql, nova-compute-kvm and prometheus-ceph-exporter units

19:54:04 DEBUG juju.cmd.juju.application bundle.go:838 created new machine 8 for holding ceph-mon, ceph-osd, ceph-radosgw, cinder, designate-bind, glance, heat, keystone, mysql, nova-compute-kvm and prometheus-openstack-exporter units

19:54:04 DEBUG juju.cmd.juju.application bundle.go:838 created new machine 9 for holding aodh, ceilometer, ceph-osd, designate, gnocchi, neutron-api, nova-cloud-controller, nova-compute-kvm, openstack-dashboard and rabbitmq-server units

19:54:04 DEBUG juju.cmd.juju.application bundle.go:838 created new machine 10 for holding ceph-osd and nova-compute-kvm units

19:54:34 - add unit nova-compute-kvm/0 to new machine 6

19:54:34 DEBUG juju.cmd.juju.application bundle.go:921 added nova-compute-kvm/0 unit to new machine
- add unit nova-compute-kvm/1 to new machine 7

19:54:35 DEBUG juju.cmd.juju.application - add unit nova-compute-kvm/2 to new machine 8
bundle.go:921 added nova-compute-kvm/1 unit to new machine

- add unit nova-compute-kvm/3 to new machine 9
19:54:35 DEBUG juju.cmd.juju.application bundle.go:921 added nova-compute-kvm/2 unit to new machine

- add unit nova-compute-kvm/4 to new machine 10
19:54:36 DEB...

Read more...

description: updated
summary: - maas reported 409 CONFLICT from "allocate" while the node matching
- constraints was available
+ Juju incorrectly placed a unit onto an existing machine
description: updated
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

We will update the bundle template on our end to include one more placement directive for nova-compute-kvm but Juju should either error out if there are not enough placement directives (according to num_units) or correctly place units for which there are no placement directives onto new machines.

description: updated
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

Also, I think this bug can be marked as invalid for MAAS based on comment #2.

Changed in maas:
status: Incomplete → Invalid
Alberto Donato (ack)
Changed in maas:
milestone: 2.7.0alpha1 → none
Revision history for this message
Tim Penhey (thumper) wrote :

Is there still a juju issue here?

Changed in juju:
status: New → Incomplete
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

Yes, per #3, either Juju needs to error out if there are not enough placement directives (according to num_units) or correctly place units for which there are no placement directives onto new machines.

Changed in juju:
status: Incomplete → New
Revision history for this message
Tim Penhey (thumper) wrote :

Oh, I didn't realise that it was a bundle issue.

This becomes more interesting because of placement directives where you specify an application.

I do wonder if anyone actually relies on the behaviour of "use the last placement directive" for extra units.

There are two primary places where this is actually useful: "lxd:new" and "appname". This means put new units in lxd containers on new machines, and colocate with another application.

Changing this would be a breaking change in the current behaviour, not something we want to do in 2.x. Perhaps a more interesting thing would be some validator to run over the bundle where you could specify the strictness.

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1835823] Re: Juju incorrectly placed a unit onto an existing machine

It may be that we should not reuse last if it was an explicit machine?

On Tue, Oct 15, 2019 at 12:05 AM Tim Penhey <email address hidden>
wrote:

> Oh, I didn't realise that it was a bundle issue.
>
> This becomes more interesting because of placement directives where you
> specify an application.
>
> I do wonder if anyone actually relies on the behaviour of "use the last
> placement directive" for extra units.
>
> There are two primary places where this is actually useful: "lxd:new"
> and "appname". This means put new units in lxd containers on new
> machines, and colocate with another application.
>
> Changing this would be a breaking change in the current behaviour, not
> something we want to do in 2.x. Perhaps a more interesting thing would
> be some validator to run over the bundle where you could specify the
> strictness.
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1835823
>
> Title:
> Juju incorrectly placed a unit onto an existing machine
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1835823/+subscriptions
>

Changed in juju:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: Medium → Low
tags: added: expirebugs-bot
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.