ResourceTracker.compute_nodes won't try to create a ComputeNode a second time if the first create() fails

Bug #1839674 reported by Matt Riedemann
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Matt Riedemann
Ocata
Triaged
Medium
Unassigned
Pike
Fix Released
Medium
Matt Riedemann
Queens
Fix Committed
Medium
Matt Riedemann
Rocky
Fix Committed
Medium
Matt Riedemann
Stein
Fix Committed
Medium
Matt Riedemann

Bug Description

I found this while writing a functional recreate test for bug 1839560.

As of this change in Ocata:

https://github.com/openstack/nova/commit/1c967593fbb0ab8b9dc8b0b509e388591d32f537

The ResourceTracker.compute_nodes dict will store the ComputeNode object *before* trying to create it:

https://github.com/openstack/nova/blob/6b7d0caad86fe32ffc49a8672de1eb7258f3b919/nova/compute/resource_tracker.py#L570-L571

The problem is if ComputeNode.create() fails for whatever reason, the next run through update_available_resource won't try to create the ComputeNode again because of this:

https://github.com/openstack/nova/blob/6b7d0caad86fe32ffc49a8672de1eb7258f3b919/nova/compute/resource_tracker.py#L546

And eventually you get errors like this:

    b'2019-08-09 17:02:59,356 ERROR [nova.compute.manager] Error updating resources for node node2.'
    b'Traceback (most recent call last):'
    b' File "/home/osboxes/git/nova/nova/compute/manager.py", line 8250, in _update_available_resource_for_node'
    b' startup=startup)'
    b' File "/home/osboxes/git/nova/nova/compute/resource_tracker.py", line 715, in update_available_resource'
    b' self._update_available_resource(context, resources, startup=startup)'
    b' File "/home/osboxes/git/nova/.tox/functional-py36/lib/python3.6/site-packages/oslo_concurrency/lockutils.py", line 328, in inner'
    b' return f(*args, **kwargs)'
    b' File "/home/osboxes/git/nova/nova/compute/resource_tracker.py", line 796, in _update_available_resource'
    b' self._update(context, cn, startup=startup)'
    b' File "/home/osboxes/git/nova/nova/compute/resource_tracker.py", line 1052, in _update'
    b' self.old_resources[nodename] = old_compute'
    b' File "/home/osboxes/git/nova/.tox/functional-py36/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__'
    b' self.force_reraise()'
    b' File "/home/osboxes/git/nova/.tox/functional-py36/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise'
    b' six.reraise(self.type_, self.value, self.tb)'
    b' File "/home/osboxes/git/nova/.tox/functional-py36/lib/python3.6/site-packages/six.py", line 693, in reraise'
    b' raise value'
    b' File "/home/osboxes/git/nova/nova/compute/resource_tracker.py", line 1046, in _update'
    b' compute_node.save()'
    b' File "/home/osboxes/git/nova/.tox/functional-py36/lib/python3.6/site-packages/oslo_versionedobjects/base.py", line 226, in wrapper'
    b' return fn(self, *args, **kwargs)'
    b' File "/home/osboxes/git/nova/nova/objects/compute_node.py", line 352, in save'
    b' db_compute = db.compute_node_update(self._context, self.id, updates)'
    b' File "/home/osboxes/git/nova/.tox/functional-py36/lib/python3.6/site-packages/oslo_versionedobjects/base.py", line 67, in getter'
    b' self.obj_load_attr(name)'
    b' File "/home/osboxes/git/nova/.tox/functional-py36/lib/python3.6/site-packages/oslo_versionedobjects/base.py", line 603, in obj_load_attr'
    b' _("Cannot load \'%s\' in the base class") % attrname)'
    b"NotImplementedError: Cannot load 'id' in the base class"

We should only map the ComputeNode when we've successfully created it.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/675704

Changed in nova:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/676278

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/676280

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/675704
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=f578146f372386e1889561cba33e95495e66ce97
Submitter: Zuul
Branch: master

commit f578146f372386e1889561cba33e95495e66ce97
Author: Matt Riedemann <email address hidden>
Date: Fri Aug 9 17:17:45 2019 -0400

    rt: only map compute node if we created it

    If ComputeNode.create() fails, the update_available_resource
    periodic will not try to create it again because it will be
    mapped in the compute_nodes dict and _init_compute_node will
    return early but trying to save changes to that ComputeNode
    object later will fail because there is no id on the object,
    since we failed to create it in the DB.

    This simply reverses the logic such that we only map the
    compute node if we successfully created it.

    Some existing _init_compute_node testing had to be changed
    since it relied on the order of when the ComputeNode object
    is created and put into the compute_nodes dict in order
    to pass the object along to some much lower-level PCI
    tracker code, which was arguably mocking too deep for a unit
    test. That is changed to avoid the low-level mocking and
    assertions and just assert that _setup_pci_tracker is called
    as expected.

    Change-Id: I9fa1d509a3de405d6246fb8670612c65c10cc93b
    Closes-Bug: #1839674

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/676285

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.opendev.org/676463

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/676278
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=648770bd6897aa2ec95df3ec55344d5803543f07
Submitter: Zuul
Branch: stable/stein

commit 648770bd6897aa2ec95df3ec55344d5803543f07
Author: Matt Riedemann <email address hidden>
Date: Fri Aug 9 17:17:45 2019 -0400

    rt: only map compute node if we created it

    If ComputeNode.create() fails, the update_available_resource
    periodic will not try to create it again because it will be
    mapped in the compute_nodes dict and _init_compute_node will
    return early but trying to save changes to that ComputeNode
    object later will fail because there is no id on the object,
    since we failed to create it in the DB.

    This simply reverses the logic such that we only map the
    compute node if we successfully created it.

    Some existing _init_compute_node testing had to be changed
    since it relied on the order of when the ComputeNode object
    is created and put into the compute_nodes dict in order
    to pass the object along to some much lower-level PCI
    tracker code, which was arguably mocking too deep for a unit
    test. That is changed to avoid the low-level mocking and
    assertions and just assert that _setup_pci_tracker is called
    as expected.

    Change-Id: I9fa1d509a3de405d6246fb8670612c65c10cc93b
    Closes-Bug: #1839674
    (cherry picked from commit f578146f372386e1889561cba33e95495e66ce97)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/rocky)

Reviewed: https://review.opendev.org/676280
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=35273a844ab2dc2494f0166d9b8228ee302acd4f
Submitter: Zuul
Branch: stable/rocky

commit 35273a844ab2dc2494f0166d9b8228ee302acd4f
Author: Matt Riedemann <email address hidden>
Date: Fri Aug 9 17:17:45 2019 -0400

    rt: only map compute node if we created it

    If ComputeNode.create() fails, the update_available_resource
    periodic will not try to create it again because it will be
    mapped in the compute_nodes dict and _init_compute_node will
    return early but trying to save changes to that ComputeNode
    object later will fail because there is no id on the object,
    since we failed to create it in the DB.

    This simply reverses the logic such that we only map the
    compute node if we successfully created it.

    Some existing _init_compute_node testing had to be changed
    since it relied on the order of when the ComputeNode object
    is created and put into the compute_nodes dict in order
    to pass the object along to some much lower-level PCI
    tracker code, which was arguably mocking too deep for a unit
    test. That is changed to avoid the low-level mocking and
    assertions and just assert that _setup_pci_tracker is called
    as expected.

    Conflicts:
          nova/compute/resource_tracker.py
          nova/tests/unit/compute/test_resource_tracker.py

    NOTE(mriedem): The resource_tracker.py conflict is due to
    not having I14a310b20bd9892e7b34464e6baad49bf5928ece in Rocky.
    The test conflicts are due to not having change
    I37e8ad5b14262d801702411c2c87e73550adda70 in Rocky.

    Change-Id: I9fa1d509a3de405d6246fb8670612c65c10cc93b
    Closes-Bug: #1839674
    (cherry picked from commit f578146f372386e1889561cba33e95495e66ce97)
    (cherry picked from commit 648770bd6897aa2ec95df3ec55344d5803543f07)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 20.0.0.0rc1

This issue was fixed in the openstack/nova 20.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 19.0.3

This issue was fixed in the openstack/nova 19.0.3 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 18.2.3

This issue was fixed in the openstack/nova 18.2.3 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/queens)

Reviewed: https://review.opendev.org/676285
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5a3430983ab37aaed6bfc7ead0eb14121ffe69d3
Submitter: Zuul
Branch: stable/queens

commit 5a3430983ab37aaed6bfc7ead0eb14121ffe69d3
Author: Matt Riedemann <email address hidden>
Date: Fri Aug 9 17:17:45 2019 -0400

    rt: only map compute node if we created it

    If ComputeNode.create() fails, the update_available_resource
    periodic will not try to create it again because it will be
    mapped in the compute_nodes dict and _init_compute_node will
    return early but trying to save changes to that ComputeNode
    object later will fail because there is no id on the object,
    since we failed to create it in the DB.

    This simply reverses the logic such that we only map the
    compute node if we successfully created it.

    Some existing _init_compute_node testing had to be changed
    since it relied on the order of when the ComputeNode object
    is created and put into the compute_nodes dict in order
    to pass the object along to some much lower-level PCI
    tracker code, which was arguably mocking too deep for a unit
    test. That is changed to avoid the low-level mocking and
    assertions and just assert that _setup_pci_tracker is called
    as expected.

    NOTE(mriedem): The tests need to be modified slightly since
    Ia69fabce8e7fd7de101e291fe133c6f5f5f7056a is not in Queens
    so _copy_resources is wrapped to set ComputeNode.uuid.

    Change-Id: I9fa1d509a3de405d6246fb8670612c65c10cc93b
    Closes-Bug: #1839674
    (cherry picked from commit f578146f372386e1889561cba33e95495e66ce97)
    (cherry picked from commit 648770bd6897aa2ec95df3ec55344d5803543f07)
    (cherry picked from commit 35273a844ab2dc2494f0166d9b8228ee302acd4f)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 17.0.13

This issue was fixed in the openstack/nova 17.0.13 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/pike)

Reviewed: https://review.opendev.org/676463
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b576461cd96c6bf707e77a2bd43fda3e486cc9e5
Submitter: Zuul
Branch: stable/pike

commit b576461cd96c6bf707e77a2bd43fda3e486cc9e5
Author: Matt Riedemann <email address hidden>
Date: Fri Aug 9 17:17:45 2019 -0400

    rt: only map compute node if we created it

    If ComputeNode.create() fails, the update_available_resource
    periodic will not try to create it again because it will be
    mapped in the compute_nodes dict and _init_compute_node will
    return early but trying to save changes to that ComputeNode
    object later will fail because there is no id on the object,
    since we failed to create it in the DB.

    This simply reverses the logic such that we only map the
    compute node if we successfully created it.

    Some existing _init_compute_node testing had to be changed
    since it relied on the order of when the ComputeNode object
    is created and put into the compute_nodes dict in order
    to pass the object along to some much lower-level PCI
    tracker code, which was arguably mocking too deep for a unit
    test. That is changed to avoid the low-level mocking and
    assertions and just assert that _setup_pci_tracker is called
    as expected.

    Conflicts:
          nova/tests/unit/compute/test_resource_tracker.py

    NOTE(mriedem): The conflict and slight change to test
    test_compute_node_create_fail_retry_works is because
    I120a98cc4c11772f24099081ef3ac44a50daf71d is not in Pike.

    Change-Id: I9fa1d509a3de405d6246fb8670612c65c10cc93b
    Closes-Bug: #1839674
    (cherry picked from commit f578146f372386e1889561cba33e95495e66ce97)
    (cherry picked from commit 648770bd6897aa2ec95df3ec55344d5803543f07)
    (cherry picked from commit 35273a844ab2dc2494f0166d9b8228ee302acd4f)
    (cherry picked from commit 5a3430983ab37aaed6bfc7ead0eb14121ffe69d3)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova pike-eol

This issue was fixed in the openstack/nova pike-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.