the pinned_cpus inventory of a compute node gets out of sync if placement returns HTTP 500 for some GET /resource_providers/<uuid> quiery

Bug #1917055 reported by Balazs Gibizer
This bug report is a duplicate of:  Bug #1729621: Inconsistent value for vcpu_used. Edit Remove
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Undecided
Unassigned
Pike
Triaged
Medium
Balazs Gibizer

Bug Description

This issue was detected in Pike deployment with triplicated control plane services. There was 3 placement services running behind a load balancer service the cluster. One of the placement service got misconfigure causing that the request forwarded to it was resulted in HTTP 500. While the other placement services worked fine.

The cluster was actively used to boot VMs with flavors having hw:cpu_policy='dedicated' extra spec (a.k.a with cpu pinning).

In the following unlucky situation the pinned_cpus information in the compute_node.numa_topology got empty even though there was VMs running on the compute host with pinned CPUs. This allowed the next VM boot to select a PCPU that was already used by another VM on the same compute.

Sequence of events:

1) VM booted on the compute selected PCPU 18 and the numa_topology pinned_cpus was updated with PCPU id 18
2) A periodic update_available_resource job was started but failed to finish due to placement responded HTTP 500 when the resource tracker tried to read the compute RP from placement (see traceback at the end of the bugreport).
3) At this point the pinned_cpus filed of the compute_node.numa_topology was already reset to [] and saved to the DB. As the periodic job was interrupted the pcpu usage was never updated in the compute_node during this periodic turn
4) A new VM boot request is received and saw that every PCPU on the compute is free and selected PCPU 18 for the new VM.
5) The next update_available_resource periodic job detected that there are overlapping pinned CPUs but it had no way to heal it, so it just logged an error. (see below).

Triage on stable/pike:

The placement failure happened at [1] during the periodic job. But before that the pinned_cpus already reset[2] and saved to the DB[3] by the periodic job triggering an rt._update() early at[6][7]. In the normal case after [1] there is a whole bunch of usage calculation based on the runnning instances[4] that populates pinned_cpus and then there is another compute node DB save[5] that saves the final inventory. So if the process in interrupted at [1] then only the available resources are saved to the DB but the usage is not.

[1] https://github.com/openstack/nova/blob/4887f10a3124ec0c50ecc7df3c264420b5960aca/nova/compute/resource_tracker.py#L906
[2] https://github.com/openstack/nova/blob/4887f10a3124ec0c50ecc7df3c264420b5960aca/nova/compute/resource_tracker.py#L654
[3] https://github.com/openstack/nova/blob/4887f10a3124ec0c50ecc7df3c264420b5960aca/nova/compute/resource_tracker.py#L901
[4] https://github.com/openstack/nova/blob/4887f10a3124ec0c50ecc7df3c264420b5960aca/nova/compute/resource_tracker.py#L768
[5] https://github.com/openstack/nova/blob/4887f10a3124ec0c50ecc7df3c264420b5960aca/nova/compute/resource_tracker.py#L800
[6]https://github.com/openstack/nova/blob/4887f10a3124ec0c50ecc7df3c264420b5960aca/nova/compute/resource_tracker.py#L751
[7] https://github.com/openstack/nova/blob/4887f10a3124ec0c50ecc7df3c264420b5960aca/nova/compute/resource_tracker.py#L586

Stack trace at 2)

2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager Traceback (most recent call last):
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 7090, in update_available_resource_for_node
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager periodic=True)
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 754, in update_available_resource
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager periodic=periodic)
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py", line 271, in inner
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager return f(*args, **kwargs)
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 793, in _update_available_resource
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager self._init_compute_node(context, resources)
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 611, in _init_compute_node
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager self._update(context, cn)
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 960, in _update
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager inv_data,
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/__init__.py", line 68, in set_inventory_for_provider
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager inv_data,
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/__init__.py", line 37, in __run_method
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager return getattr(self.instance, __name)(*args, **kwargs)
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/report.py", line 797, in set_inventory_for_provider
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager self._ensure_resource_provider(rp_uuid, rp_name)
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/report.py", line 522, in _ensure_resource_provider
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager rp = self._get_resource_provider(uuid)
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/report.py", line 56, in wrapper
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager return f(self, *a, **k)
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/report.py", line 423, in _get_resource_provider
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager raise exception.ResourceProviderRetrievalFailed(uuid=uuid)
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager ResourceProviderRetrievalFailed: Failed to get resource provider with UUID 5a87bae9-81b4-4a00-bb80-bc0b378b4471
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager

Stack trace at 5)
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager Traceback (most recent call last):
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 7090, in update_available_resource_for_node
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager periodic=True)
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 754, in update_available_resource
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager periodic=periodic)
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py", line 271, in inner
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager return f(*args, **kwargs)
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 816, in _update_available_resource
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager startup=startup)
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 1284, in _update_usage_from_instances
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager require_allocation_refresh=require_allocation_refresh)
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 1170, in _update_usage_from_instance
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager sign=sign)
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 997, in _update_usage
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager cn, usage, free)
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/virt/hardware.py", line 1808, in get_host_numa_usage_from_instance
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager host_numa_topology, instance_numa_topology, free=free))
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/virt/hardware.py", line 1657, in numa_usage_from_instances
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager newcell.pin_cpus(pinned_cpus)
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/objects/numa.py", line 91, in pin_cpus
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager self.pinned_cpus))
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager CPUPinningInvalid: CPU set to pin [18] must be a subset of free CPU set [2, 3, 4, 5, 6, 7, 10, 11, 15, 16, 17, 19, 42, 43, 44, 45, 46, 47, 50, 51, 55, 56, 57, 58, 59]

description: updated
tags: added: resource-tracker
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

Hm, It seems to be very similar to https://bugs.launchpad.net/nova/+bug/1729621 marking this as duplicate

Changed in nova:
status: New → Invalid
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

I've verified the bugfix backort https://review.opendev.org/c/openstack/nova/+/612295 on the env that previously was used to reproduce the problem and the fix works.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.