ResourceTracker.compute_nodes won't try to create a ComputeNode a second time if the first create() fails

Bug #1839674 reported by Matt Riedemann on 2019-08-09
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Matt Riedemann
Ocata
Medium
Unassigned
Pike
Medium
Matt Riedemann
Queens
Medium
Matt Riedemann
Rocky
Medium
Matt Riedemann
Stein
Medium
Matt Riedemann

Bug Description

I found this while writing a functional recreate test for bug 1839560.

As of this change in Ocata:

https://github.com/openstack/nova/commit/1c967593fbb0ab8b9dc8b0b509e388591d32f537

The ResourceTracker.compute_nodes dict will store the ComputeNode object *before* trying to create it:

https://github.com/openstack/nova/blob/6b7d0caad86fe32ffc49a8672de1eb7258f3b919/nova/compute/resource_tracker.py#L570-L571

The problem is if ComputeNode.create() fails for whatever reason, the next run through update_available_resource won't try to create the ComputeNode again because of this:

https://github.com/openstack/nova/blob/6b7d0caad86fe32ffc49a8672de1eb7258f3b919/nova/compute/resource_tracker.py#L546

And eventually you get errors like this:

    b'2019-08-09 17:02:59,356 ERROR [nova.compute.manager] Error updating resources for node node2.'
    b'Traceback (most recent call last):'
    b' File "/home/osboxes/git/nova/nova/compute/manager.py", line 8250, in _update_available_resource_for_node'
    b' startup=startup)'
    b' File "/home/osboxes/git/nova/nova/compute/resource_tracker.py", line 715, in update_available_resource'
    b' self._update_available_resource(context, resources, startup=startup)'
    b' File "/home/osboxes/git/nova/.tox/functional-py36/lib/python3.6/site-packages/oslo_concurrency/lockutils.py", line 328, in inner'
    b' return f(*args, **kwargs)'
    b' File "/home/osboxes/git/nova/nova/compute/resource_tracker.py", line 796, in _update_available_resource'
    b' self._update(context, cn, startup=startup)'
    b' File "/home/osboxes/git/nova/nova/compute/resource_tracker.py", line 1052, in _update'
    b' self.old_resources[nodename] = old_compute'
    b' File "/home/osboxes/git/nova/.tox/functional-py36/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__'
    b' self.force_reraise()'
    b' File "/home/osboxes/git/nova/.tox/functional-py36/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise'
    b' six.reraise(self.type_, self.value, self.tb)'
    b' File "/home/osboxes/git/nova/.tox/functional-py36/lib/python3.6/site-packages/six.py", line 693, in reraise'
    b' raise value'
    b' File "/home/osboxes/git/nova/nova/compute/resource_tracker.py", line 1046, in _update'
    b' compute_node.save()'
    b' File "/home/osboxes/git/nova/.tox/functional-py36/lib/python3.6/site-packages/oslo_versionedobjects/base.py", line 226, in wrapper'
    b' return fn(self, *args, **kwargs)'
    b' File "/home/osboxes/git/nova/nova/objects/compute_node.py", line 352, in save'
    b' db_compute = db.compute_node_update(self._context, self.id, updates)'
    b' File "/home/osboxes/git/nova/.tox/functional-py36/lib/python3.6/site-packages/oslo_versionedobjects/base.py", line 67, in getter'
    b' self.obj_load_attr(name)'
    b' File "/home/osboxes/git/nova/.tox/functional-py36/lib/python3.6/site-packages/oslo_versionedobjects/base.py", line 603, in obj_load_attr'
    b' _("Cannot load \'%s\' in the base class") % attrname)'
    b"NotImplementedError: Cannot load 'id' in the base class"

We should only map the ComputeNode when we've successfully created it.

Fix proposed to branch: master
Review: https://review.opendev.org/675704

Changed in nova:
status: Triaged → In Progress

Reviewed: https://review.opendev.org/675704
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=f578146f372386e1889561cba33e95495e66ce97
Submitter: Zuul
Branch: master

commit f578146f372386e1889561cba33e95495e66ce97
Author: Matt Riedemann <email address hidden>
Date: Fri Aug 9 17:17:45 2019 -0400

    rt: only map compute node if we created it

    If ComputeNode.create() fails, the update_available_resource
    periodic will not try to create it again because it will be
    mapped in the compute_nodes dict and _init_compute_node will
    return early but trying to save changes to that ComputeNode
    object later will fail because there is no id on the object,
    since we failed to create it in the DB.

    This simply reverses the logic such that we only map the
    compute node if we successfully created it.

    Some existing _init_compute_node testing had to be changed
    since it relied on the order of when the ComputeNode object
    is created and put into the compute_nodes dict in order
    to pass the object along to some much lower-level PCI
    tracker code, which was arguably mocking too deep for a unit
    test. That is changed to avoid the low-level mocking and
    assertions and just assert that _setup_pci_tracker is called
    as expected.

    Change-Id: I9fa1d509a3de405d6246fb8670612c65c10cc93b
    Closes-Bug: #1839674

Changed in nova:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers