Cleaning up stale resource allocations can fail with InstanceNotFound

Bug #1693899 reported by Matt Riedemann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Chris Dent

Bug Description

Noticed what appears to be a race on delete of an instance where we're deleting resource provider allocations records for a deleted instance and when we go to get the instance from the database it's already deleted:

http://logs.openstack.org/32/468232/1/gate/gate-tempest-dsvm-cells-ubuntu-xenial/ad969ba/logs/screen-n-cpu.txt.gz?level=TRACE#_May_26_17_29_52_292260

May 26 17:29:52.292260 ubuntu-xenial-ovh-bhs1-9010496 nova-compute[11974]: ERROR nova.compute.manager [req-c0f69b9e-c6eb-43e1-9bc9-372a42901987 None None] Error updating resources for node ubuntu-xenial-ovh-bhs1-9010496.
May 26 17:29:52.292537 ubuntu-xenial-ovh-bhs1-9010496 nova-compute[11974]: ERROR nova.compute.manager Traceback (most recent call last):
May 26 17:29:52.292730 ubuntu-xenial-ovh-bhs1-9010496 nova-compute[11974]: ERROR nova.compute.manager File "/opt/stack/new/nova/nova/compute/manager.py", line 6588, in update_available_resource_for_node
May 26 17:29:52.292908 ubuntu-xenial-ovh-bhs1-9010496 nova-compute[11974]: ERROR nova.compute.manager rt.update_available_resource(context, nodename)
May 26 17:29:52.293082 ubuntu-xenial-ovh-bhs1-9010496 nova-compute[11974]: ERROR nova.compute.manager File "/opt/stack/new/nova/nova/compute/resource_tracker.py", line 626, in update_available_resource
May 26 17:29:52.293266 ubuntu-xenial-ovh-bhs1-9010496 nova-compute[11974]: ERROR nova.compute.manager self._update_available_resource(context, resources)
May 26 17:29:52.293430 ubuntu-xenial-ovh-bhs1-9010496 nova-compute[11974]: ERROR nova.compute.manager File "/usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py", line 271, in inner
May 26 17:29:52.293592 ubuntu-xenial-ovh-bhs1-9010496 nova-compute[11974]: ERROR nova.compute.manager return f(*args, **kwargs)
May 26 17:29:52.293750 ubuntu-xenial-ovh-bhs1-9010496 nova-compute[11974]: ERROR nova.compute.manager File "/opt/stack/new/nova/nova/compute/resource_tracker.py", line 667, in _update_available_resource
May 26 17:29:52.293909 ubuntu-xenial-ovh-bhs1-9010496 nova-compute[11974]: ERROR nova.compute.manager self._update_usage_from_instances(context, instances, nodename)
May 26 17:29:52.294070 ubuntu-xenial-ovh-bhs1-9010496 nova-compute[11974]: ERROR nova.compute.manager File "/opt/stack/new/nova/nova/compute/resource_tracker.py", line 1047, in _update_usage_from_instances
May 26 17:29:52.294235 ubuntu-xenial-ovh-bhs1-9010496 nova-compute[11974]: ERROR nova.compute.manager self._remove_deleted_instances_allocations(context, cn)
May 26 17:29:52.294453 ubuntu-xenial-ovh-bhs1-9010496 nova-compute[11974]: ERROR nova.compute.manager File "/opt/stack/new/nova/nova/compute/resource_tracker.py", line 1061, in _remove_deleted_instances_allocations
May 26 17:29:52.294628 ubuntu-xenial-ovh-bhs1-9010496 nova-compute[11974]: ERROR nova.compute.manager expected_attrs=[])
May 26 17:29:52.294797 ubuntu-xenial-ovh-bhs1-9010496 nova-compute[11974]: ERROR nova.compute.manager File "/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 177, in wrapper
May 26 17:29:52.294983 ubuntu-xenial-ovh-bhs1-9010496 nova-compute[11974]: ERROR nova.compute.manager args, kwargs)
May 26 17:29:52.295107 ubuntu-xenial-ovh-bhs1-9010496 nova-compute[11974]: ERROR nova.compute.manager File "/opt/stack/new/nova/nova/conductor/rpcapi.py", line 240, in object_class_action_versions
May 26 17:29:52.295205 ubuntu-xenial-ovh-bhs1-9010496 nova-compute[11974]: ERROR nova.compute.manager args=args, kwargs=kwargs)
May 26 17:29:52.295290 ubuntu-xenial-ovh-bhs1-9010496 nova-compute[11974]: ERROR nova.compute.manager File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 169, in call
May 26 17:29:52.295374 ubuntu-xenial-ovh-bhs1-9010496 nova-compute[11974]: ERROR nova.compute.manager retry=self.retry)
May 26 17:29:52.295457 ubuntu-xenial-ovh-bhs1-9010496 nova-compute[11974]: ERROR nova.compute.manager File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 123, in _send
May 26 17:29:52.295547 ubuntu-xenial-ovh-bhs1-9010496 nova-compute[11974]: ERROR nova.compute.manager timeout=timeout, retry=retry)
May 26 17:29:52.295630 ubuntu-xenial-ovh-bhs1-9010496 nova-compute[11974]: ERROR nova.compute.manager File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 505, in send
May 26 17:29:52.295712 ubuntu-xenial-ovh-bhs1-9010496 nova-compute[11974]: ERROR nova.compute.manager retry=retry)
May 26 17:29:52.295795 ubuntu-xenial-ovh-bhs1-9010496 nova-compute[11974]: ERROR nova.compute.manager File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 496, in _send
May 26 17:29:52.295877 ubuntu-xenial-ovh-bhs1-9010496 nova-compute[11974]: ERROR nova.compute.manager raise result
May 26 17:29:52.295959 ubuntu-xenial-ovh-bhs1-9010496 nova-compute[11974]: ERROR nova.compute.manager InstanceNotFound_Remote: Instance 9ecdbc52-8f98-45aa-9814-ade07faa1026 could not be found.

We should handle the InstanceNotFound since in this path we are cleaning up anyway.

Looks like this is triggered by something semi-recent:

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Error%20updating%20resources%20for%20node%5C%22%20AND%20message%3A%5C%22_remove_deleted_instances_allocations%5C%22%20AND%20message%3A%5C%22InstanceNotFound_Remote%5C%22%20AND%20tags%3A%5C%22screen-n-cpu.txt%5C%22&from=7d

Revision history for this message
Matt Riedemann (mriedem) wrote :

This just recently merged and it looks like it's not accounting for the instance being gone:

https://github.com/openstack/nova/commit/439f67e7bc795f171af8248400c3f3e17a0bbe7e

Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
Chris Dent (cdent)
Changed in nova:
assignee: nobody → Chris Dent (cdent)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/468517

Changed in nova:
status: Confirmed → In Progress
Changed in nova:
assignee: Chris Dent (cdent) → Matt Riedemann (mriedem)
Changed in nova:
assignee: Matt Riedemann (mriedem) → Chris Dent (cdent)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/468517
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=98b8e39ac5f7b3f2bb06ca415bbb806705461d74
Submitter: Jenkins
Branch: master

commit 98b8e39ac5f7b3f2bb06ca415bbb806705461d74
Author: Chris Dent <email address hidden>
Date: Fri May 26 19:12:46 2017 +0000

    Catch InstanceNotFound when deleting allocations

    If the list of allocations that are going to be deleted includes an
    instance that can no longer be loaded by get_by_uuid (because
    something has already removed it) we need to catch InstanceNotFound
    and remove the allocations as planned. Without this change, the
    exception is not caught, causing an error and leaving behind the
    allocations we don't want to be there.

    Change-Id: I204b077c287141a7f5643b2cc0065da2ba395c03
    Closes-Bug: #1693899

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 16.0.0.0b2

This issue was fixed in the openstack/nova 16.0.0.0b2 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.