`socket` PCI NUMA policy doesn't work if another instance with a NUMA topology is booted first on the same host

Bug #1995153 reported by Artom Lifshitz
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Undecided
Unassigned

Bug Description

Disclaimer: I haven't reproduced this in a functional test, but based on the traceback that I gathered from a real environment as well as the fact that the proposed fix actually fixes this, I think my theory is correct

Description
===========
`socket` PCI NUMA policy doesn't work if another instance is booted first on the same host

Steps to reproduce
==================
1. Boot any instance.
2. Boot an instance with the `socket` PCI NUMA policy on the same host.

Expected result
===============
`socket` instance boots.

Actual result
=============
Instance creation fails with

Details: Fault: {'code': 500, 'created': '2022-10-28T20:17:31Z', 'message': 'NotImplementedError'}. Server boot request ID: req-e3fd15d7-fb79-440f-b2f3-e6b2a5505e56.

Environment
===========
Originally reported as part of QE verification of [1], so stable/wallaby.

Additional info
===============
Playing around with the whitebox test for the socket policy [2] on a wallaby deployment, I noticed that the `socket` field in the compute.numa_topology column was being switched to `null` then back to its correct value (0 or 1).

I added logging of the stack trace to the resource tracker _udpate() method right before it calls compute_node.save(), and found that `null` was getting saved when an instance was being booted or deleted. Example of a traceback:

File "/usr/lib/python3.9/site-packages/nova/utils.py", line 686, in context_wrapper\n func(*args, **kwargs)\n'
' File "/usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py", line 360, in inner\n return f(*args, **kwargs)\n'
' File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 2126, in _locked_do_build_and_run_instance\n result = self._do_build_and_run_instance(*args, **kwargs)\n'
' File "/usr/lib/python3.9/site-packages/nova/exception_wrapper.py", line 63, in wrapped\n return f(self, context, *args, **kw)\n'
' File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 154, in decorated_function\n return function(self, context, *args, **kwargs)\n'
' File "/usr/lib/python3.9/site-packages/nova/compute/utils.py", line 1434, in decorated_function\n return function(self, context, *args, **kwargs)\n'
' File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 200, in decorated_function\n return function(self, context, *args, **kwargs)\n'
' File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 2232, in _do_build_and_run_instance\n self._build_and_run_instance(context, instance, image,\n'
' File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 2383, in _build_and_run_instance\n with self.rt.instance_claim(context, instance, node, allocs,\n'
' File "/usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py", line 360, in inner\n return f(*args, **kwargs)\n'
' File "/usr/lib/python3.9/site-packages/nova/compute/resource_tracker.py", line 197, in instance_claim\n self._update(elevated, cn)\n'
' File "/usr/lib/python3.9/site-packages/nova/compute/resource_tracker.py", line 1247, in _update\n LOG.debug(\'artom: %s\', traceback.format_stack())\n'] _update /usr/lib/python3.9/site-packages/nova/compute/resource_tracker.py:1247

Similarly for delete:

2022-10-28 21:57:27.091 2 DEBUG nova.compute.resource_tracker [req-c9fa718c-983e-416c-bc87-9564b8747294 d6d16a793ab74fe6a0b5594d037d3165 599a6777a45d46a09a7e233a926b7675 - default default] artom: [' File "/usr/lib/python3.9/site-packages
/eventlet/greenpool.py", line 88, in _spawn_n_impl\n func(*args, **kwargs)\n', ' File "/usr/lib/python3.9/site-packages/futurist/_green.py", line 71, in __call__\n self.work.run()\n', ' File "/usr/lib/python3.9/site-packages/futur
ist/_utils.py", line 49, in run\n result = self.fn(*self.args, **self.kwargs)\n', ' File "/usr/lib/python3.9/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming\n res = self.dispatcher.dispatch(message)\n',
' File "/usr/lib/python3.9/site-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch\n return self._do_dispatch(endpoint, method, ctxt, args)\n', ' File "/usr/lib/python3.9/site-packages/oslo_messaging/rpc/dispatcher.py"
, line 229, in _do_dispatch\n result = func(ctxt, **new_args)\n', ' File "/usr/lib/python3.9/site-packages/nova/exception_wrapper.py", line 63, in wrapped\n return f(self, context, *args, **kw)\n', ' File "/usr/lib/python3.9/site-
packages/nova/compute/manager.py", line 154, in decorated_function\n return function(self, context, *args, **kwargs)\n', ' File "/usr/lib/python3.9/site-packages/nova/compute/utils.py", line 1434, in decorated_function\n return fun
ction(self, context, *args, **kwargs)\n', ' File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 200, in decorated_function\n return function(self, context, *args, **kwargs)\n', ' File "/usr/lib/python3.9/site-packag
es/nova/compute/manager.py", line 3072, in terminate_instance\n do_terminate_instance(instance, bdms)\n', ' File "/usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py", line 360, in inner\n return f(*args, **kwargs)\n', '
  File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 3060, in do_terminate_instance\n self._delete_instance(context, instance, bdms)\n', ' File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 3024,
in _delete_instance\n self._complete_deletion(context, instance)\n', ' File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 828, in _complete_deletion\n self._update_resource_tracker(context, instance)\n', ' File
"/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 596, in _update_resource_tracker\n self.rt.update_usage(context, instance, instance.node)\n', ' File "/usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py", li
ne 360, in inner\n return f(*args, **kwargs)\n', ' File "/usr/lib/python3.9/site-packages/nova/compute/resource_tracker.py", line 658, in update_usage\n self._update(context.elevated(), self.compute_nodes[nodename])\n', ' File "/u
sr/lib/python3.9/site-packages/nova/compute/resource_tracker.py", line 1247, in _update\n LOG.debug(\'artom: %s\', traceback.format_stack())\n'] _update /usr/lib/python3.9/site-packages/nova/compute/resource_tracker.py:1247

On the other hand, the resource tracker's periodic resource update task was saving the socket correctly:

2022-10-28 21:57:59.794 2 DEBUG nova.compute.resource_tracker [req-31329b8b-0de4-4b30-b2a1-dcd4d62369b4 - - - - -] artom: [' File "/usr/lib/python3.9/site-packages/eventlet/greenthread.py", line 221, in main\n result = function(*args, **kwargs)\n', ' File "/usr/lib/python3.9/site-packages/oslo_service/loopingcall.py", line 150, in _run_loop\n result = func(*self.args, **self.kw)\n', ' File "/usr/lib/python3.9/site-packages/nova/service.py", line 307, in periodic_tasks\n return self.manager.periodic_tasks(ctxt, raise_on_error=raise_on_error)\n', ' File "/usr/lib/python3.9/site-packages/nova/manager.py", line 104, in periodic_tasks\n return self.run_periodic_tasks(context, raise_on_error=raise_on_error)\n', ' File "/usr/lib/python3.9/site-packages/oslo_service/periodic_task.py", line 216, in run_periodic_tasks\n task(self, context)\n', ' File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 10026, in update_available_resource\n self._update_available_resource_for_node(context, nodename,\n', ' File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 9935, in _update_available_resource_for_node\n self.rt.update_available_resource(context, nodename,\n', ' File "/usr/lib/python3.9/site-packages/nova/compute/resource_tracker.py", line 896, in update_available_resource\n self._update_available_resource(context, resources, startup=startup)\n', ' File "/usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py", line 360, in inner\n return f(*args, **kwargs)\n', ' File "/usr/lib/python3.9/site-packages/nova/compute/resource_tracker.py", line 1003, in _update_available_resource\n self._update(context, cn, startup=startup)\n', ' File "/usr/lib/python3.9/site-packages/nova/compute/resource_tracker.py", line 1247, in _update\n LOG.debug(\'artom: %s\', traceback.format_stack())\n'] _update /usr/lib/python3.9/site-packages/nova/compute/resource_tracker.py:1247

Not included in the above for brevity is the log line showing what was actually being saved, you'll just have to trust me on this one ;)

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1883554
[2] https://review.opendev.org/c/openstack/whitebox-tempest-plugin/+/851447

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/862964

Changed in nova:
status: New → In Progress
summary: - `socket` PCI NUMA policy doesn't work if another instance is booted
- first on the same host
+ `socket` PCI NUMA policy doesn't work if another instance with a NUMA
+ topology is booted first on the same host
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/862967

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/862967
Committed: https://opendev.org/openstack/nova/commit/63d6ecd99b7dec06cf0cf8358b43b0d8fa607504
Submitter: "Zuul (22348)"
Branch: master

commit 63d6ecd99b7dec06cf0cf8358b43b0d8fa607504
Author: Artom Lifshitz <email address hidden>
Date: Fri Oct 28 19:42:21 2022 -0400

    Reproduce bug 1995153

    If we first boot an instance with NUMA topology on a host, any
    subsequent attempts to boot instances with the `socket` PCI NUMA
    policy will fail with `Cannot load 'socket' in the base class`.
    Demonstrate this in a functional test.

    Change-Id: I63f4e3dfa38f65b73d0051b8e52b1abd0f027e9b
    Related-bug: 1995153

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/nova/+/862964
Committed: https://opendev.org/openstack/nova/commit/04ebae9dc01ebd24552b5aacd1a0f8b129013a9e
Submitter: "Zuul (22348)"
Branch: master

commit 04ebae9dc01ebd24552b5aacd1a0f8b129013a9e
Author: Artom Lifshitz <email address hidden>
Date: Fri Oct 28 18:09:35 2022 -0400

    Save cell socket correctly when updating host NUMA topology

    Previously, in numa_usage_from_instance_numa(), any new NUMACell
    objects we created did not have the `socket` attribute. In some cases
    this was persisted all the way down to the database. Fix this by
    copying `socket` from the old_cell.

    Change-Id: I9ed3c31ccd3220b02d951fc6dbc5ea049a240a68
    Closes-Bug: 1995153

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/2023.1)

Fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/nova/+/882313

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/nova/+/882314

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/zed)

Fix proposed to branch: stable/zed
Review: https://review.opendev.org/c/openstack/nova/+/882315

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/zed
Review: https://review.opendev.org/c/openstack/nova/+/882316

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/yoga)

Fix proposed to branch: stable/yoga
Review: https://review.opendev.org/c/openstack/nova/+/882317

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/yoga
Review: https://review.opendev.org/c/openstack/nova/+/882318

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/xena)

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/nova/+/882319

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/nova/+/882320

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/882321

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/882322

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/2023.1)

Reviewed: https://review.opendev.org/c/openstack/nova/+/882313
Committed: https://opendev.org/openstack/nova/commit/29e3f2f2ab69157d938cfe6895df352ef9a08d8c
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit 29e3f2f2ab69157d938cfe6895df352ef9a08d8c
Author: Artom Lifshitz <email address hidden>
Date: Fri Oct 28 19:42:21 2022 -0400

    Reproduce bug 1995153

    If we first boot an instance with NUMA topology on a host, any
    subsequent attempts to boot instances with the `socket` PCI NUMA
    policy will fail with `Cannot load 'socket' in the base class`.
    Demonstrate this in a functional test.

    Change-Id: I63f4e3dfa38f65b73d0051b8e52b1abd0f027e9b
    Related-bug: 1995153
    (cherry picked from commit 63d6ecd99b7dec06cf0cf8358b43b0d8fa607504)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/nova/+/882314
Committed: https://opendev.org/openstack/nova/commit/acb511652c1afb8253c66c29ca10f790f035229e
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit acb511652c1afb8253c66c29ca10f790f035229e
Author: Artom Lifshitz <email address hidden>
Date: Fri Oct 28 18:09:35 2022 -0400

    Save cell socket correctly when updating host NUMA topology

    Previously, in numa_usage_from_instance_numa(), any new NUMACell
    objects we created did not have the `socket` attribute. In some cases
    this was persisted all the way down to the database. Fix this by
    copying `socket` from the old_cell.

    Change-Id: I9ed3c31ccd3220b02d951fc6dbc5ea049a240a68
    Closes-Bug: 1995153
    (cherry picked from commit 04ebae9dc01ebd24552b5aacd1a0f8b129013a9e)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 27.1.0

This issue was fixed in the openstack/nova 27.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 28.0.0.0rc1

This issue was fixed in the openstack/nova 28.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 26.2.1

This issue was fixed in the openstack/nova 26.2.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/yoga)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/yoga
Review: https://review.opendev.org/c/openstack/nova/+/882317
Reason: stable/yoga branch of openstack/nova is about to be deleted. To be able to do that, all open patches need to be abandoned. Please cherry pick the patch to unmaintained/yoga if you want to further work on this patch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/yoga
Review: https://review.opendev.org/c/openstack/nova/+/882318
Reason: stable/yoga branch of openstack/nova is about to be deleted. To be able to do that, all open patches need to be abandoned. Please cherry pick the patch to unmaintained/yoga if you want to further work on this patch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/wallaby)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/882322
Reason: stable/wallaby branch of openstack/nova is about to be deleted. To be able to do that, all open patches need to be abandoned. Please cherry pick the patch to unmaintained/wallaby if you want to further work on this patch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/882321
Reason: stable/wallaby branch of openstack/nova is about to be deleted. To be able to do that, all open patches need to be abandoned. Please cherry pick the patch to unmaintained/wallaby if you want to further work on this patch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/xena)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/xena
Review: https://review.opendev.org/c/openstack/nova/+/882320
Reason: stable/xena branch of openstack/nova is about to be deleted. To be able to do that, all open patches need to be abandoned. Please cherry pick the patch to unmaintained/xena if you want to further work on this patch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/xena
Review: https://review.opendev.org/c/openstack/nova/+/882319
Reason: stable/xena branch of openstack/nova is about to be deleted. To be able to do that, all open patches need to be abandoned. Please cherry pick the patch to unmaintained/xena if you want to further work on this patch.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.