OpenStack Compute (nova)

Continual warnings in n-cpu logs about being unable to delete inventory for an ironic node with an instance on it

Bug #1710141 reported by Matt Riedemann on 2017-08-11

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Fix Released	High	Dmitry Tantsur
	Ocata	New	Undecided	Belmiro Moreira

Bug Description

Seen here:

http://logs.openstack.org/54/487954/12/check/gate-tempest-dsvm-ironic-ipa-wholedisk-bios-agent_ipmitool-tinyipa-ubuntu-xenial-nv/041c03a/logs/screen-n-cpu.txt.gz#_Aug_09_19_31_21_450705

Aug 09 19:31:21.450705 ubuntu-xenial-internap-mtl01-10351013 nova-compute[19132]: WARNING nova.scheduler.client.report [None req-9db22a6d-e88a-42b0-879e-8fe523dcc664 None None] [req-2eead243-5e63-4dd0-a208-4ceed95478ff] We cannot delete inventory 'VCPU, MEMORY_MB, DISK_GB' for resource provider 38b274b2-2e37-4c23-ad6f-d86c1f0a0e3f because the inventory is in use.

As soon as an ironic node has an instance built on it, the node state is ACTIVE which means that this method returns True:

https://github.com/openstack/nova/blob/c2d33c3271370358d48553233b41bf9119d834fb/nova/virt/ironic/driver.py#L176

Saying the node is unavailable, because it's wholly consumed I guess.

That's used here:

https://github.com/openstack/nova/blob/c2d33c3271370358d48553233b41bf9119d834fb/nova/virt/ironic/driver.py#L324

And that's checked here when reporting inventory to the resource tracker:

https://github.com/openstack/nova/blob/c2d33c3271370358d48553233b41bf9119d834fb/nova/virt/ironic/driver.py#L741

Which then tries to delete the inventory for the node resource provider in placement, which fails because it's already got an instance running on it that is consuming inventory:

http://logs.openstack.org/54/487954/12/check/gate-tempest-dsvm-ironic-ipa-wholedisk-bios-agent_ipmitool-tinyipa-ubuntu-xenial-nv/041c03a/logs/screen-n-cpu.txt.gz#_Aug_09_19_31_21_450705

Aug 09 19:31:21.391146 ubuntu-xenial-internap-mtl01-10351013 nova-compute[19132]: INFO nova.scheduler.client.report [None req-9db22a6d-e88a-42b0-879e-8fe523dcc664 None None] Compute node 38b274b2-2e37-4c23-ad6f-d86c1f0a0e3f reported no inventory but previous inventory was detected. Deleting existing inventory records.
Aug 09 19:31:21.450705 ubuntu-xenial-internap-mtl01-10351013 nova-compute[19132]: WARNING nova.scheduler.client.report [None req-9db22a6d-e88a-42b0-879e-8fe523dcc664 None None] [req-2eead243-5e63-4dd0-a208-4ceed95478ff] We cannot delete inventory 'VCPU, MEMORY_MB, DISK_GB' for resource provider 38b274b2-2e37-4c23-ad6f-d86c1f0a0e3f because the inventory is in use.

This is also bad because if the node was updated with a resource_class, that resource class won't be automatically created in Placement here:

https://github.com/openstack/nova/blob/c2d33c3271370358d48553233b41bf9119d834fb/nova/scheduler/client/report.py#L789

Because the driver didn't report it in the get_inventory method.

And that has an impact on this code to migrate instance.flavor.extra_specs to have custom resource class overrides from ironic nodes that now have a resource_class set:

https://review.openstack.org/#/c/487954/

So we've got a bit of a chicken and egg problem here.

Manually testing the ironic flavor migration code hits this problem, as seen here:

http://paste.openstack.org/show/618160/

Tags:

Matt Riedemann (mriedem) on 2017-08-11

Changed in nova:
status:	New → Triaged
importance:	Undecided → High
tags:	added: pike-rc-potential

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-08-11:

One question is, why don't we report inventory for an ACTIVE node? If the inventory is 1 but an instance is also allocating that 1 of whatever resource class, then isn't that sufficient? In other words, if an instance is consuming all of the node inventory, that should take the node out of scheduling decisions for building new instances, which is also how things work for regular compute nodes for building VMs.

Dmitry Tantsur (divius) on 2017-08-11

Changed in nova:
assignee:	nobody → Dmitry Tantsur (divius)
status:	Triaged → In Progress

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-08-11:

If we're just fixing the warnings, then this isn't pike-rc-potential for rc2, we could just fix the warnings issue and backport to stable/pike and stable/ocata since it's not a regression in pike.

Revision history for this message

Vladyslav Drok (vdrok) wrote on 2017-08-11:

Patch is at https://review.openstack.org/492964

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-08-16: Fix proposed to nova (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/494216

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-08-16: Fix merged to nova (master)

Reviewed: https://review.openstack.org/492964
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9ed692bf8c84e0a702536101cd6cb084d33e1c26
Submitter: Jenkins
Branch: master

commit 9ed692bf8c84e0a702536101cd6cb084d33e1c26
Author: Dmitry Tantsur <email address hidden>
Date: Fri Aug 11 13:52:02 2017 +0200

Fix reporting inventory for provisioned nodes in the Ironic driver

Currently we report the full inventory for available nodes, and an empty
inventory for nodes that are deployed to or otherwise unavailable.

    Reporting an empty inventory for deployed nodes has 2 bad consequences:
    1. Nova tries deleting the inventory for Placement, which fails, because
       the resources are still in use. This results in nasty warnings.
    2. When adding a resource class to a deployed node, it does not get into
       inventory, and thus does not get to Placement. It results in an error
       later on, when the custom resource class is not found.

    This patch fixes the latter problem by
    1. Always reporting the custom resource class for deployed nodes, if present.
    2. Reporting VCPUS/memory/disk in exactly the same amount, as it is configured
       in the ironic node's properties.

As a side effect, the warnings are no longer shown for deployed nodes.
They still appear, however, for nodes during cleaning.

Partial-Bug: #1710141
Change-Id: I2fd1e4a95f000da19864e75299afa51527697101

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-08-16: Fix merged to nova (stable/pike)

Reviewed: https://review.openstack.org/494216
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c92337bdf80fea4c0a8ebb433bacec4cc07f7a94
Submitter: Jenkins
Branch: stable/pike

commit c92337bdf80fea4c0a8ebb433bacec4cc07f7a94
Author: Dmitry Tantsur <email address hidden>
Date: Fri Aug 11 13:52:02 2017 +0200

Fix reporting inventory for provisioned nodes in the Ironic driver

Currently we report the full inventory for available nodes, and an empty
inventory for nodes that are deployed to or otherwise unavailable.

As a side effect, the warnings are no longer shown for deployed nodes.
They still appear, however, for nodes during cleaning.

    Partial-Bug: #1710141
    Change-Id: I2fd1e4a95f000da19864e75299afa51527697101
    (cherry picked from commit 9ed692bf8c84e0a702536101cd6cb084d33e1c26)

tags:

added: in-stable-pike

Matt Riedemann (mriedem) on 2017-08-23

tags:

removed: pike-rc-potential

Matt Riedemann (mriedem) on 2017-10-18

Changed in nova:
status:	In Progress → Fix Released

Revision history for this message

Mark Goddard (mgoddard) wrote on 2018-01-24:

I think we're seeing a similar issue to this on Pike with this fix applied. We have resource providers for active ironic nodes that have no inventory in placement. When the resource tracker tries to update the RP allocations, we get a conflict because there is no inventory to allocate from for the RP:

Unable to submit allocation for instance 5073a390-6c10-4b1b-a097-f9f67485463c (409 <html>
<head>
  <title>409 Conflict</title>
</head>
<body>
  <h1>409 Conflict</h1>
  There was a conflict when trying to complete your request.<br /><br />
Unable to allocate inventory: Inventory for 'VCPU, MEMORY_MB, DISK_GB, CUSTOM_CUSTOM_B' on resource provider '2ce4ab0c-0d8b-4ae2-b84d-4bbc1888df52' invalid.
</body>
</html>)

It's likely that at some point the original RPs for the ironic nodes were deleted - due to https://bugs.launchpad.net/nova/+bug/1714248.

It seems that the inventory is not reported to the resource tracker since the fix for https://bugs.launchpad.net/nova/+bug/1723423 was merged, as it reports no inventory when the node is unavailable (note the TODO [1]).

Environment: CentOS 7.4 kolla containers, RDO python-nova-16.0.3-2.el7.noarch.

[1] https://github.com/openstack/nova/blob/d25feca/nova/virt/ironic/driver.py#L758