Deleted ironic node has an inventory in nova_api database

Bug #1661258 reported by Vladyslav Drok
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Chris Dent

Bug Description

Running latest devstack, ironic and nova, I get the following error when I request an instance:

| fault | {"message": "Node 6cc8803d-4e77-4948-b653-663d8d5e52b7 could not be found. (HTTP 404)", "code": 500, "details": " File \"/opt/stack/nova/nova/compute/manager.py\", line 1780, in _do_build_and_run_instance |
| | filter_properties) |
| | File \"/opt/stack/nova/nova/compute/manager.py\", line 2016, in _build_and_run_instance |
| | instance_uuid=instance.uuid, reason=six.text_type(e)) |
| | ", "created": "2017-02-02T13:42:01Z"} |

On ironic side, this node was indeed deleted, it is also deleted from nova.compute_nodes table:

| created_at | updated_at | deleted_at | id | service_id | vcpus | memory_mb | local_gb | vcpus_used | memory_mb_used | local_gb_used | hypervisor_type | hypervisor_version | cpu_info | disk_available_least | free_ram_mb | free_disk_gb | current_workload | running_vms | hypervisor_hostname | deleted | host_ip | supported_instances | pci_stats | metrics | extra_resources | stats | numa_topology | host | ram_allocation_ratio | cpu_allocation_ratio | uuid | disk_allocation_ratio |
...................................................
| 2017-02-02 12:20:27 | 2017-02-02 13:20:15 | 2017-02-02 13:21:15 | 2 | NULL | 1 | 1536 | 10 | 0 | 0 | 0 | ironic | 1 | | 10 | 1536 | 10 | 0 | 0 | 6cc8803d-4e77-4948-b653-663d8d5e52b7 | 2 | 192.168.122.22 | [["x86_64", "baremetal", "hvm"]] | {"nova_object.version": "1.1", "nova_object.changes": ["objects"], "nova_object.name": "PciDevicePoolList", "nova_object.data": {"objects": []}, "nova_object.namespace": "nova"} | [] | NULL | {"cpu_arch": "x86_64"} | NULL | ubuntu | 1 | 0 | 035be695-0797-44b3-930b-42349e40579e | 0 |

But in nova_api.inventories it's still there:

| created_at | updated_at | id | resource_provider_id | resource_class_id | total | reserved | min_unit | max_unit | step_size | allocation_ratio |
..........................
| 2017-02-02 13:20:14 | NULL | 13 | 2 | 0 | 1 | 0 | 1 | 1 | 1 | 16 |
| 2017-02-02 13:20:14 | NULL | 14 | 2 | 1 | 1536 | 0 | 1 | 1536 | 1 | 1 |
| 2017-02-02 13:20:14 | NULL | 15 | 2 | 2 | 10 | 0 | 1 | 10 | 1 | 1 |

nova_api.resource_providers bit:
| created_at | updated_at | id | uuid | name | generation | can_host |
.........................
| 2017-02-02 12:20:27 | 2017-02-02 13:20:14 | 2 | 035be695-0797-44b3-930b-42349e40579e | 6cc8803d-4e77-4948-b653-663d8d5e52b7 | 7 | 0 |

Waiting for resource tracker run did not help, node's been deleted for ~30 minutes already and the inventory is still there.

Code versions:
Devstack commit debc695ddfc8b7b2aeb53c01c624e15f69ed9fa2 Updated from generate-devstack-plugins-list.
Nova commit 5dad7eaef7f8562425cce6b233aed610ca2d3148 Merge "doc: update the man page entry for nova-manage db sync"
Ironic commit 5071b99835143ebcae876432e2982fd27faece10 Merge "Remove deprecated heartbeat policy check"

If it is anyhow relevant, I also run two nova-computes on the same host, I've set host=test for the second one, other than that all configs are the same. I was trying to reproduce another cell-related issue, and was creating/deleting ironic nodes, so that they map to the second nova-compute by the hash_ring.

Matt Riedemann (mriedem)
tags: added: ironic placement resource-tracker
Revision history for this message
Jay Pipes (jaypipes) wrote :

@vdrok: Do me a favor and comment out the call to node_is_available() here:

https://github.com/openstack/nova/blob/master/nova/compute/resource_tracker.py#L414

and re-run.

Revision history for this message
Vladyslav Drok (vdrok) wrote :
Download full text (4.7 KiB)

The bit from the n-cpu log:

2017-02-02 16:30:34.816 DEBUG nova.compute.manager [req-d0b3d3d8-d1d2-43de-8eec-efaa0d75b7e3 admin admin] [instance: 6e971b2d-31ff-4aa2-b737-c5
ece9d37d23] Start spawning the instance on the hypervisor. from (pid=30650) _build_and_run_instance /opt/stack/nova/nova/compute/manager.py:192
5
2017-02-02 16:30:34.816 DEBUG nova.virt.ironic.driver [req-d0b3d3d8-d1d2-43de-8eec-efaa0d75b7e3 admin admin] [instance: 6e971b2d-31ff-4aa2-b737
-c5ece9d37d23] Spawn called for instance from (pid=30650) spawn /opt/stack/nova/nova/virt/ironic/driver.py:802
2017-02-02 16:30:34.854 ERROR nova.compute.manager [req-d0b3d3d8-d1d2-43de-8eec-efaa0d75b7e3 admin admin] [instance: 6e971b2d-31ff-4aa2-b737-c5ece9d37d23] Instance failed to spawn
2017-02-02 16:30:34.854 TRACE nova.compute.manager [instance: 6e971b2d-31ff-4aa2-b737-c5ece9d37d23] Traceback (most recent call last):
2017-02-02 16:30:34.854 TRACE nova.compute.manager [instance: 6e971b2d-31ff-4aa2-b737-c5ece9d37d23] File "/opt/stack/nova/nova/compute/manager.py", line 2125, in _build_resources
2017-02-02 16:30:34.854 TRACE nova.compute.manager [instance: 6e971b2d-31ff-4aa2-b737-c5ece9d37d23] yield resources
2017-02-02 16:30:34.854 TRACE nova.compute.manager [instance: 6e971b2d-31ff-4aa2-b737-c5ece9d37d23] File "/opt/stack/nova/nova/compute/manager.py", line 1930, in _build_and_run_instance
2017-02-02 16:30:34.854 TRACE nova.compute.manager [instance: 6e971b2d-31ff-4aa2-b737-c5ece9d37d23] block_device_info=block_device_info)
2017-02-02 16:30:34.854 TRACE nova.compute.manager [instance: 6e971b2d-31ff-4aa2-b737-c5ece9d37d23] File "/opt/stack/nova/nova/virt/ironic/driver.py", line 812, in spawn
2017-02-02 16:30:34.854 TRACE nova.compute.manager [instance: 6e971b2d-31ff-4aa2-b737-c5ece9d37d23] node = self._get_node(node_uuid)
2017-02-02 16:30:34.854 TRACE nova.compute.manager [instance: 6e971b2d-31ff-4aa2-b737-c5ece9d37d23] File "/opt/stack/nova/nova/virt/ironic/driver.py", line 160, in _get_node
2017-02-02 16:30:34.854 TRACE nova.compute.manager [instance: 6e971b2d-31ff-4aa2-b737-c5ece9d37d23] fields=_NODE_FIELDS)
2017-02-02 16:30:34.854 TRACE nova.compute.manager [instance: 6e971b2d-31ff-4aa2-b737-c5ece9d37d23] File "/opt/stack/nova/nova/virt/ironic/client_wrapper.py", line 148, in call
2017-02-02 16:30:34.854 TRACE nova.compute.manager [instance: 6e971b2d-31ff-4aa2-b737-c5ece9d37d23] return self._multi_getattr(client, method)(*args, **kwargs)
2017-02-02 16:30:34.854 TRACE nova.compute.manager [instance: 6e971b2d-31ff-4aa2-b737-c5ece9d37d23] File "/usr/local/lib/python2.7/dist-packages/ironicclient/v1/node.py", line 195, in get
2017-02-02 16:30:34.854 TRACE nova.compute.manager [instance: 6e971b2d-31ff-4aa2-b737-c5ece9d37d23] return self._get(resource_id=node_id, fields=fields)
2017-02-02 16:30:34.854 TRACE nova.compute.manager [instance: 6e971b2d-31ff-4aa2-b737-c5ece9d37d23] File "/usr/local/lib/python2.7/dist-packages/ironicclient/common/base.py", line 88, in _get
2017-02-02 16:30:34.854 TRACE nova.compute.manager [instance: 6e971b2d-31ff-4aa2-b737-c5ece9d37d23] return self._list(self._path(resource_id))[0]
2017-02-02 16:30:34.854 TRACE nova.co...

Read more...

Revision history for this message
Matt Riedemann (mriedem) wrote :

Seems we should have the compute manager call into the scheduler report client and delete the resource provider when the orphaned compute node is deleted here:

https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L6585

tags: added: compute
tags: added: ocata-rc-potential
Changed in nova:
importance: Undecided → High
status: New → Triaged
Jay Pipes (jaypipes)
Changed in nova:
assignee: nobody → Ed Leafe (ed-leafe)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/428375

Changed in nova:
status: Triaged → In Progress
Changed in nova:
assignee: Ed Leafe (ed-leafe) → Chris Dent (cdent)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/428375
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=965fffc09d6fffba7918117e170d5799c69fe99b
Submitter: Jenkins
Branch: master

commit 965fffc09d6fffba7918117e170d5799c69fe99b
Author: EdLeafe <email address hidden>
Date: Thu Feb 2 18:48:35 2017 +0000

    Delete a compute node's resource provider when node is deleted

    Currently when a compute node is deleted, its record in the cell DB is
    deleted, but its representation as a resource provider in the placement
    service remains, along with any inventory and allocations. This could
    cause the placement engine to return that provider record, even though
    the compute node no longer exists. And since the periodic "healing" by
    the resource tracker only updates compute node resources for records in
    the compute_nodes table, these old records are never removed.

    This patch adds a call to delete the resource provider when the compute
    node is deleted. It also adds a method to the scheduler report client
    to make these calls to the placement API.

    Partial-Bug: #1661258
    Closes-Bug: #1661014

    Change-Id: I6098d186d05ff8b9a568e23f860295a7bc2e6447

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/429274

Matt Riedemann (mriedem)
Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/429274
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=916ed4bfe42f8a3b2315e8cf4e07dec35d212657
Submitter: Jenkins
Branch: master

commit 916ed4bfe42f8a3b2315e8cf4e07dec35d212657
Author: Matt Riedemann <email address hidden>
Date: Sat Feb 4 09:07:35 2017 -0500

    Cleanup the caches when deleting a resource provider

    When we delete a resource provider in the scheduler
    report we should also cleanup any entries we have in
    the resource provider and aggregate map caches.

    Change-Id: I7f15720d3aaff5d5a504d04729b8003f51e2c37d
    Related-Bug: #1661258

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/ocata)

Related fix proposed to branch: stable/ocata
Review: https://review.openstack.org/604021

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/ocata)

Reviewed: https://review.openstack.org/604021
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ed36e85f6f5ba5bcf15267036d666bd42142cd4a
Submitter: Zuul
Branch: stable/ocata

commit ed36e85f6f5ba5bcf15267036d666bd42142cd4a
Author: Matt Riedemann <email address hidden>
Date: Sat Feb 4 09:07:35 2017 -0500

    Cleanup the caches when deleting a resource provider

    When we delete a resource provider in the scheduler
    report we should also cleanup any entries we have in
    the resource provider and aggregate map caches.

    Change-Id: I7f15720d3aaff5d5a504d04729b8003f51e2c37d
    Related-Bug: #1661258
    (cherry picked from commit 916ed4bfe42f8a3b2315e8cf4e07dec35d212657)

tags: added: in-stable-ocata
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.