OVB overcloud deploy fails on nova placement errors

Bug #1787910 reported by Sagi (Sergey) Shnaidman on 2018-08-20
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Vladyslav Drok
Rocky
High
Matt Riedemann
tripleo
Critical
Marios Andreou

Bug Description

https://logs.rdoproject.org/openstack-periodic/git.openstack.org/openstack-infra/tripleo-ci/master/legacy-periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master/1544941/logs/undercloud/var/log/extra/errors.txt.gz#_2018-08-20_01_49_09_830

https://logs.rdoproject.org/openstack-periodic/git.openstack.org/openstack-infra/tripleo-ci/master/legacy-periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master/1544941/logs/undercloud/var/log/extra/docker/containers/nova_placement/log/nova/nova-compute.log.txt.gz?level=ERROR#_2018-08-20_01_49_09_830

ERROR nova.scheduler.client.report [req-a8752223-5d75-4fa2-9668-7c024d166f09 - - - - -] [req-561538c7-b837-448b-b25e-38a3505ab2e5] Failed to update inventory to [{u'CUSTOM_BAREMETAL': {'allocation_ratio': 1.0, 'total': 1, 'reserved': 1, 'step_size': 1, 'min_unit': 1, 'max_unit': 1}}] for resource provider with UUID 3ee26a05-944b-42ba-b74d-42aa2fda5d73. Got 400: {"errors": [{"status": 400, "request_id": "req-561538c7-b837-448b-b25e-38a3505ab2e5", "detail": "The server could not comply with the request since it is either malformed or otherwise incorrect.\n\n Unable to update inventory for resource provider 3ee26a05-944b-42ba-b74d-42aa2fda5d73: Invalid inventory for 'CUSTOM_BAREMETAL' on resource provider '3ee26a05-944b-42ba-b74d-42aa2fda5d73'. The reserved value is greater than or equal to total. ", "title": "Bad Request"}]}

ERROR nova.compute.manager [req-a8752223-5d75-4fa2-9668-7c024d166f09 - - - - -] Error updating resources for node 3ee26a05-944b-42ba-b74d-42aa2fda5d73.: ResourceProviderSyncFailed: Failed to synchronize the placement service with resource provider information supplied by the compute host.

Traceback (most recent call last):
botkaERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 7722, in _update_available_resource_for_node
botkaERROR nova.compute.manager rt.update_available_resource(context, nodename)
botkaERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 703, in update_available_resource
botkaERROR nova.compute.manager self._update_available_resource(context, resources)
botkaERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 274, in inner
botkaERROR nova.compute.manager return f(*args, **kwargs)
botkaERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 726, in _update_available_resource
botkaERROR nova.compute.manager self._init_compute_node(context, resources)
botkaERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 593, in _init_compute_node
botkaERROR nova.compute.manager self._update(context, cn)
botkaERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/retrying.py", line 68, in wrapped_f
botkaERROR nova.compute.manager return Retrying(*dargs, **dkw).call(f, *args, **kw)
botkaERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/retrying.py", line 223, in call
botkaERROR nova.compute.manager return attempt.get(self._wrap_exception)
botkaERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/retrying.py", line 261, in get
botkaERROR nova.compute.manager six.reraise(self.value[0], self.value[1], self.value[2])
botkaERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/retrying.py", line 217, in call
botkaERROR nova.compute.manager attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
botkaERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 938, in _update
botkaERROR nova.compute.manager self._update_to_placement(context, compute_node)
botkaERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 907, in _update_to_placement
botkaERROR nova.compute.manager reportclient.update_from_provider_tree(context, prov_tree)
botkaERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py", line 37, in __run_method
botkaERROR nova.compute.manager return getattr(self.instance, __name)(*args, **kwargs)
botkaERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/scheduler/client/report.py", line 1531, in update_from_provider_tree
botkaERROR nova.compute.manager raise exception.ResourceProviderSyncFailed()
botkaERROR nova.compute.manager ResourceProviderSyncFailed: Failed to synchronize the placement service with resource provider information supplied by the compute host.
botkaERROR nova.compute.manager

Marios Andreou (marios-b) wrote :

the only place i can find reference of CUSTOM_BAREMETAL is in this [1] undercloud post config bash which is called for the heat deployed undercloud. However its the overcloud deploy which fails with "ResourceInError: resources.Controller: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500" like [2] (and same for all the node types as you can see in [2]).

[1] https://github.com/openstack/tripleo-heat-templates/blob/46ef07433632ae5639df2e53e181ec0abd0d7b34/extraconfig/post_deploy/undercloud_post.sh#L78
[2] https://logs.rdoproject.org/openstack-periodic/git.openstack.org/openstack-infra/tripleo-ci/master/legacy-periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master/1544941/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2018-08-20_02_03_49

Dmitry Tantsur (divius) wrote :

For the record: the same problem broke ironic-inspector CI upstream. Rocky does not seem affected, nor does ironic itself.

Vladyslav Drok (vdrok) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/593628

Changed in nova:
assignee: nobody → Vladyslav Drok (vdrok)
status: New → In Progress
Changed in nova:
assignee: Vladyslav Drok (vdrok) → Matt Riedemann (mriedem)
Matt Riedemann (mriedem) on 2018-08-20
Changed in nova:
assignee: Matt Riedemann (mriedem) → Vladyslav Drok (vdrok)

Reviewed: https://review.openstack.org/593628
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=55fb7efe3110e26a993be291cd2cfac1df8c4679
Submitter: Zuul
Branch: master

commit 55fb7efe3110e26a993be291cd2cfac1df8c4679
Author: Vladyslav Drok <email address hidden>
Date: Mon Aug 20 16:57:56 2018 +0300

    Use placement microversion 1.26 in update_from_provider_tree

    Recent change I1fd85860c96e8690fbcf93c8a2f02178168bfd5a changed the
    microversion for updating the inventory only in the
    _update_inventory_attempt, missing _set_inventory_for_provider
    which is called from update_from_provider_tree.
    It causes failures with ironic virt driver.

    Closes-Bug: 1787910
    Change-Id: Ibdebd02ce6f52ca87559e9d2d5c068f37bf4b6db

Changed in nova:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/593678
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=aebb29ed710fb4b5b92a6efa99f1d3dbdcc48a0c
Submitter: Zuul
Branch: stable/rocky

commit aebb29ed710fb4b5b92a6efa99f1d3dbdcc48a0c
Author: Vladyslav Drok <email address hidden>
Date: Mon Aug 20 16:57:56 2018 +0300

    Use placement microversion 1.26 in update_from_provider_tree

    Recent change I1fd85860c96e8690fbcf93c8a2f02178168bfd5a changed the
    microversion for updating the inventory only in the
    _update_inventory_attempt, missing _set_inventory_for_provider
    which is called from update_from_provider_tree.
    It causes failures with ironic virt driver.

    Closes-Bug: 1787910
    Change-Id: Ibdebd02ce6f52ca87559e9d2d5c068f37bf4b6db
    (cherry picked from commit 55fb7efe3110e26a993be291cd2cfac1df8c4679)

This issue was fixed in the openstack/nova 18.0.0.0rc2 release candidate.

Marios Andreou (marios-b) wrote :

o/ yatin I suspect it is because we don't have the fix in the job yet... looking at one of the ones you posted, nova is like:

 https://logs.rdoproject.org/openstack-periodic/git.openstack.org/openstack-infra/tripleo-ci/master/legacy-periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master/92ab44e/logs/undercloud/etc/yum.repos.d/delorean.repo.txt.gz

 so openstack-nova-18.0.0-0.20180820195340.722d5b4.el7.noarch.rpm

 @ http://mirror.regionone.rdo-cloud-tripleo.rdoproject.org:8080/rdo/centos7/f6/c7/f6c761d2138824849ca3036e49210e7649978ebb_cda88c99/

deducing from the filename 20180820 means it won't include the fix from the 21st

I checked current-tripleo has openstack-nova-18.0.0-0.20180813044524.16f89fd.el7.noarch.rpm [1] and current [2] has nova-18.0.0-0.20180822065816.17b6957.el7.noarch.rpm which is probably the one we want (or newer)

[1] https://trunk.rdoproject.org/centos7-master/current-tripleo/
[2] https://trunk.rdoproject.org/centos7-master/current/

yatin (yatinkarel) wrote :

So checked further and found https://review.openstack.org/#/c/565841 to be possible cause. Just to confirm this i hacked nova/virt/ironic/driver a little and overcloud deployment completed.
So i can say that either this patch missed something, or it needs to be fixed at some place(resource_classes are not used correctly as the commit message says).

From logs:- --- No valid Host in overcloud deployment-----
baremetal nodes created using --resource-class baremetal and flavors have resources:CUSTOM_BAREMETAL='1' property

1) Filter RamFilter returned 0 hosts
https://logs.rdoproject.org/openstack-periodic/git.openstack.org/openstack-infra/tripleo-ci/master/legacy-periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset002-master-upload/9a25ebd/logs/undercloud/var/log/containers/nova/nova-scheduler.log.txt.gz#_2018-08-21_13_27_59_434

2) Hypervisor/Node resource view: name=8c577f07-b040-4b28-9dc0-40e22524a366 free_ram=0MB free_disk=0GB free_vcpus=unknown pci_devices=None
https://logs.rdoproject.org/openstack-periodic/git.openstack.org/openstack-infra/tripleo-ci/master/legacy-periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset002-master-upload/9a25ebd/logs/undercloud/var/log/containers/nova/nova-compute.log.txt.gz#_2018-08-21_13_23_07_721

phys_ram=0MB used_ram=0MB phys_disk=0GB used_disk=0GB total_vcpus=0 used_vcpus=0 pci_stats=[]
https://logs.rdoproject.org/openstack-periodic/git.openstack.org/openstack-infra/tripleo-ci/master/legacy-periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset002-master-upload/9a25ebd/logs/undercloud/var/log/containers/nova/nova-compute.log.txt.gz#_2018-08-21_13_23_07_929

Can someone from nova/ironic look at it?

Seems like the flavors in these jobs still require some amount of RAM - they should only require the resource class.

See the "openstack flavor set --property resources:MEMORY_MB=0" bits here: https://docs.openstack.org/ironic/latest/install/configure-nova-flavors.html

Matt Riedemann (mriedem) wrote :

The flavor override code in nova is here:

https://github.com/openstack/nova/blob/cc436c2b2a2dad974c4d28871851a456ebd80e48/nova/scheduler/utils.py#L220

Which was changed from this:

https://review.openstack.org/#/c/515223/12/nova/scheduler/utils.py@a224

It's possible that regressed some override functionality - it's not easy to parse what this is doing without looking at tests, and I'm not sure we have any functional tests in nova that rely on this behavior (we definitely should if we don't).

yatin (yatinkarel) wrote :

So we have flavor created with properties:- properties | capabilities:boot_option='local', capabilities:profile='control', resources:CUSTOM_BAREMETAL='1', resources:DISK_GB='0', resources:MEMORY_MB='0', resources:VCPU='0' which is correct, jroll confirmed it.

As matt said in https://bugs.launchpad.net/tripleo/+bug/1787910/comments/15, there can be a regression bug in nova with ironic driver.

Also suggested: for driver=filter_scheduler(which is used in undercloud) there is no need to use CoreFilter, RamFilter and DiskFilter. These filters are required atleast with caching_scheduler.
I tried without these filters and overcloud deployment completed.

The possible reason it's seen in tripleo jobs and not seen in ironic devstack jobs after https://review.openstack.org/#/c/565841 is that these filters are not enabled there. As a quick check it can be tested with ironic jobs by enabling these filters.

Changed in tripleo:
milestone: rocky-rc1 → rocky-rc2
Matt Riedemann (mriedem) wrote :

So tl;dr https://review.openstack.org/#/c/565841 meant that ironic compute nodes no longer report vcpu/ram/disk to the scheduler so the core/ram/disk filters thought the node (via the HostState object) didn't have any inventory and filtered out that host. We definitely need to update the documentation on the nova core/ram/disk filters to mention they are no longer needed when using the filter_scheduler driver.

https://docs.openstack.org/nova/latest/admin/configuration/schedulers.html#ramfilter

We should probably go so far as to deprecate them since only the CachingScheduler is using them in-tree and still relies on them (because CachingScheduler doesn't use placement) and the CachingScheduler itself is deprecated.

wes hayutin (weshayutin) wrote :

https://review.openstack.org/596610 <--- is required to run the ovb ooo reproduce tooling.

Reviewed: https://review.openstack.org/596502
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=243ba8513097ca715af738422a9e1249c6a7a1f0
Submitter: Zuul
Branch: master

commit 243ba8513097ca715af738422a9e1249c6a7a1f0
Author: Matt Riedemann <email address hidden>
Date: Fri Aug 24 19:10:08 2018 -0400

    Deprecate Core/Ram/DiskFilter

    The time has come.

    These filters haven't been necessary since Ocata [1]
    when the filter scheduler started using placement
    to filter on VCPU, DISK_GB and MEMORY_MB. The
    only reason to use them with any in-tree scheduler
    drivers is if using the CachingScheduler which doesn't
    use placement, but the CachingScheduler itself has
    been deprecated since Pike [2]. Furthermore, as of
    change [3] in Stein, the ironic driver no longer
    reports vcpu/ram/disk inventory for ironic nodes
    which will make these filters filter out ironic nodes
    thinking they don't have any inventory. Also, as
    noted in [4], the DiskFilter does not account for
    volume-backed instances and may incorrectly filter
    out a host based on disk inventory when it would
    otherwise be OK if the instance is not using local
    disk.

    The related aggregate filters are left intact for
    now, see blueprint placement-aggregate-allocation-ratios.

    [1] Ie12acb76ec5affba536c3c45fbb6de35d64aea1b
    [2] Ia7ff98ff28b7265058845e46b277317a2bfc96d2
    [3] If2b8c1a76d7dbabbac7bb359c9e572cfed510800
    [4] I9c2111f7377df65c1fc3c72323f85483b3295989

    Change-Id: Id62136d293da55e4bb639635ea5421a33b6c3ea2
    Related-Bug: #1787910

Reviewed: https://review.openstack.org/596093
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=04b235652b44701b8703f63aee10fac6fad13ced
Submitter: Zuul
Branch: master

commit 04b235652b44701b8703f63aee10fac6fad13ced
Author: yatin <email address hidden>
Date: Fri Aug 24 11:36:25 2018 +0530

    Do not enable Ram/Disk Filter with filter_scheduler

    Core/Ram/Disk Filters are not required when using filter_scheduler.

    After https://review.openstack.org/#/c/565841 when using these
    Filters nova is not scheduling to the ironic nodes and overcloud
    deployment fails.
    For now just testing the undercloud, good to see what scheduler/filters
    are being enabled in overcloud and reflect there as well.

    Related-Bug: #1787910
    Depends-On: Ia82f1c6be0d5504498e77a90268cad8abecdeae2
    Change-Id: I0e376d99adeaa318118833018be81491c6b14095

Jiří Stránský (jistr) wrote :

FWIW i hit this issue locally, haven't been able to work around it yet. Yatin's patch is present in my environment, yet nova-placement still prints this:

2018-09-05 13:53:13.940 17 DEBUG nova.api.openstack.placement.objects.resource_provider [req-6e4e2426-dc02-4b02-abe5-f8f3255d0942 a316605a9f8c4153b537147920200beb b4964854cdd04011884ad74cb840f279 - default default] found 0 providers with available 2 VCPU _get_provider_ids_matching /usr/lib/python2.7/site-packages/nova/api/openstack/placement/objects/resource_provider.py:2928

Reviewed: https://review.openstack.org/598167
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=49916c09216479a8dd54e55b4c6e86dae8246fa3
Submitter: Zuul
Branch: stable/rocky

commit 49916c09216479a8dd54e55b4c6e86dae8246fa3
Author: yatin <email address hidden>
Date: Fri Aug 24 11:36:25 2018 +0530

    Do not enable Ram/Disk Filter with filter_scheduler

    Core/Ram/Disk Filters are not required when using filter_scheduler.

    After https://review.openstack.org/#/c/565841 when using these
    Filters nova is not scheduling to the ironic nodes and overcloud
    deployment fails.
    For now just testing the undercloud, good to see what scheduler/filters
    are being enabled in overcloud and reflect there as well.

    Related-Bug: #1787910
    Depends-On: Ia82f1c6be0d5504498e77a90268cad8abecdeae2
    Change-Id: I0e376d99adeaa318118833018be81491c6b14095
    (cherry picked from commit 04b235652b44701b8703f63aee10fac6fad13ced)

tags: added: in-stable-rocky
wes hayutin (weshayutin) on 2018-09-11
Changed in tripleo:
status: Triaged → Fix Released

Reviewed: https://review.openstack.org/608894
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=0ec9a3db94a6a0e09387f7d86860d93708c524db
Submitter: Zuul
Branch: master

commit 0ec9a3db94a6a0e09387f7d86860d93708c524db
Author: Marios Andreou <email address hidden>
Date: Tue Oct 9 12:16:22 2018 +0300

    Remove deprecated Ram/Disk filters in NovaSchedulerDefaultFilters

    As reported in the related bug below and merged for the undercloud
    with https://review.openstack.org/#/c/598167 the Ram/Disk filters
    are deprecated since [1] so we should stop using them.

    [1] https://review.openstack.org/#/c/596502/
    Related-Bug: 1787910
    Change-Id: Ib3585b4c04c974c34d61b868d0454df03c1a2aed

Oliver Walsh (owalsh) wrote :

Backporting the tripleo change as far as pike as that's where the nova defaults dropped these filters.

melanie witt (melwitt) wrote :

Adding a note here because I was looking to understand why we didn't catch the ResourceProviderSyncFailed bug [1] where we weren't sending the needed microversion 1.26 in update_from_provider_tree.

If I'm understanding correctly, there are two different issues captured in this launchpad bug:

1. ResourceProviderSyncFailed because of not sending microversion 1.26 in all places it was needed
2. TripleO CI was running with core/ram/disk filters enabled and that made it fail

The CI failure was caused by use of the core/ram/disk filters and the only way to fix it was to discontinue use of the filters. And the ResourceProviderSyncFailed issue was caught by chance at the same time because the error was noticed in the logs. Otherwise, ResourceProviderSyncFailed alone did not cause any CI to fail.

[1] http://lists.openstack.org/pipermail/openstack-dev/2018-October/135494.html

Reviewed: https://review.openstack.org/611286
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=1d736b7e675176fa7170f82e0545ff156fea6f76
Submitter: Zuul
Branch: stable/rocky

commit 1d736b7e675176fa7170f82e0545ff156fea6f76
Author: Marios Andreou <email address hidden>
Date: Tue Oct 9 12:16:22 2018 +0300

    Remove deprecated Ram/Disk filters in NovaSchedulerDefaultFilters

    As reported in the related bug below and merged for the undercloud
    with https://review.openstack.org/#/c/598167 the Ram/Disk filters
    are deprecated since [1] so we should stop using them.

    [1] https://review.openstack.org/#/c/596502/
    Related-Bug: 1787910
    Change-Id: Ib3585b4c04c974c34d61b868d0454df03c1a2aed
    (cherry picked from commit 0ec9a3db94a6a0e09387f7d86860d93708c524db)

Change abandoned by wes hayutin (<email address hidden>) on branch: master
Review: https://review.openstack.org/596610

This issue was fixed in the openstack/nova 19.0.0.0rc1 release candidate.

Reviewed: https://review.opendev.org/660023
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=306412539a92c03ac06c0bad38d6d14f2ea2f4f3
Submitter: Zuul
Branch: stable/queens

commit 306412539a92c03ac06c0bad38d6d14f2ea2f4f3
Author: Marios Andreou <email address hidden>
Date: Tue Oct 9 12:16:22 2018 +0300

    Remove deprecated Ram/Disk filters in NovaSchedulerDefaultFilters

    As reported in the related bug below and merged for the undercloud
    with https://review.openstack.org/#/c/598167 the Ram/Disk filters
    are deprecated since [1] so we should stop using them.

    Conflicts:
           environments/neutron-ml2-ovn-hw-offload.yaml
           environments/ovs-hw-offload.yaml
           environments/services-docker/neutron-sriov.yaml

    [1] https://review.openstack.org/#/c/596502/
    Related-Bug: 1787910
    Change-Id: Ib3585b4c04c974c34d61b868d0454df03c1a2aed
    (cherry picked from commit 0ec9a3db94a6a0e09387f7d86860d93708c524db)
    (cherry picked from commit 1d736b7e675176fa7170f82e0545ff156fea6f76)

tags: added: in-stable-queens

Reviewed: https://review.opendev.org/664524
Committed: https://git.openstack.org/cgit/openstack/instack-undercloud/commit/?id=89b09d3ed18f07f571b58d6a8049f08c9216e12e
Submitter: Zuul
Branch: stable/queens

commit 89b09d3ed18f07f571b58d6a8049f08c9216e12e
Author: Rajesh Tailor <email address hidden>
Date: Tue Jun 11 13:53:21 2019 +0530

    Remove deprecated Ram/Disk filters in scheduler_default_filters

    After https://review.openstack.org/#/c/565841 when using these
    Filters nova is not scheduling to the ironic nodes and overcloud
    deployment fails.

    Ram/Disk Filters are not required when using filter_scheduler.

    Since stable/queens doesn't have containerized undercloud, need
    to remove from instack-undercloud.

    Ram/Disk filters are deprecated in [1].
    [1] https://review.opendev.org/#/c/596502/

    Change-Id: I2515438c5ca80758be173523b0bf4c031b54706f
    Related-Bug: 1787910

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers