The update_available_resource() periodic task in the compute fails with exception.CPUPinningInvalid exception (and stop processing the rest of the instances) if there is an incoming migration (or resize or evacuation) that is in post-migrating state (not yet executed finish_resize) and the instance has CPU pinning.
Reproduce:
* build a multinode env with dedicated cpus and cpu pinning configured
* configure the update_available_resource to run frequently (just to ease the reproduction of the race) (e.g. set [DEFAULT]update_resources_interval = 10)
* create inst1 on the first node and create inst2 on the second node both with requesting one pinned cpu
* check that inst1 pinned to the same pcpu id on node1 as inst2 on node2
* slow down the processing on finish_resize messages in the system to ease the reproduction of the race (e.g. inject sleep or load rabbit etc.)
* migrate inst1 to node2
If you are managed to hit the case when the periodic runs on node2 just after the resize_claim of inst1 finished but the finish_resize RPC call of inst1 is not processed (the migration context is not applied to the instance and the migration is not in finished state but in post-migration) then you will see a CPU pinning conflict. It is because the resource tracker already tracks the incoming instance [1] (the host and node is set in resize_instance already[2]) but the instance still not have the migration context applied (as it is only done in finish_resize[3]) so the instance.numa_topology still points to the source topology.
Reproduced both in stable/victoria downstream and in latest master in an upstream devstack.
2021-12-06 15:07:18,013 ERROR [nova.compute.manager] Error updating resources for node compute2.
Traceback (most recent call last):
File "/root/rtox/nova/functional-py38/nova/compute/manager.py", line 10011, in _update_available_resource_for_node
self.rt.update_available_resource(context, nodename,
File "/root/rtox/nova/functional-py38/nova/compute/resource_tracker.py", line 895, in update_available_resource
self._update_available_resource(context, resources, startup=startup)
File "/root/rtox/nova/functional-py38/.tox/functional-py38/lib/python3.8/site-packages/oslo_concurrency/lockutils.py", line 391, in inner
return f(*args, **kwargs)
File "/root/rtox/nova/functional-py38/nova/compute/resource_tracker.py", line 936, in _update_available_resource
instance_by_uuid = self._update_usage_from_instances(
File "/root/rtox/nova/functional-py38/nova/compute/resource_tracker.py", line 1500, in _update_usage_from_instances
self._update_usage_from_instance(context, instance, nodename)
File "/root/rtox/nova/functional-py38/nova/compute/resource_tracker.py", line 1463, in _update_usage_from_instance
self._update_usage(self._get_usage_dict(instance, instance),
File "/root/rtox/nova/functional-py38/nova/compute/resource_tracker.py", line 1268, in _update_usage
cn.numa_topology = hardware.numa_usage_from_instance_numa(
File "/root/rtox/nova/functional-py38/nova/virt/hardware.py", line 2382, in numa_usage_from_instance_numa
new_cell.pin_cpus(pinned_cpus)
File "/root/rtox/nova/functional-py38/nova/objects/numa.py", line 95, in pin_cpus
raise exception.CPUPinningInvalid(requested=list(cpus),
nova.exception.CPUPinningInvalid: CPU set to pin [0] must be a subset of free CPU set [1]
The update_ available_ resource( ) periodic task in the compute fails with exception. CPUPinningInval id exception (and stop processing the rest of the instances) if there is an incoming migration (or resize or evacuation) that is in post-migrating state (not yet executed finish_resize) and the instance has CPU pinning.
Reproduce: available_ resource to run frequently (just to ease the reproduction of the race) (e.g. set [DEFAULT] update_ resources_ interval = 10)
* build a multinode env with dedicated cpus and cpu pinning configured
* configure the update_
* create inst1 on the first node and create inst2 on the second node both with requesting one pinned cpu
* check that inst1 pinned to the same pcpu id on node1 as inst2 on node2
* slow down the processing on finish_resize messages in the system to ease the reproduction of the race (e.g. inject sleep or load rabbit etc.)
* migrate inst1 to node2
If you are managed to hit the case when the periodic runs on node2 just after the resize_claim of inst1 finished but the finish_resize RPC call of inst1 is not processed (the migration context is not applied to the instance and the migration is not in finished state but in post-migration) then you will see a CPU pinning conflict. It is because the resource tracker already tracks the incoming instance [1] (the host and node is set in resize_instance already[2]) but the instance still not have the migration context applied (as it is only done in finish_resize[3]) so the instance. numa_topology still points to the source topology.
Reproduced both in stable/victoria downstream and in latest master in an upstream devstack.
2021-12-06 15:07:18,013 ERROR [nova.compute. manager] Error updating resources for node compute2. rtox/nova/ functional- py38/nova/ compute/ manager. py", line 10011, in _update_ available_ resource_ for_node rt.update_ available_ resource( context, nodename, rtox/nova/ functional- py38/nova/ compute/ resource_ tracker. py", line 895, in update_ available_ resource _update_ available_ resource( context, resources, startup=startup) rtox/nova/ functional- py38/.tox/ functional- py38/lib/ python3. 8/site- packages/ oslo_concurrenc y/lockutils. py", line 391, in inner rtox/nova/ functional- py38/nova/ compute/ resource_ tracker. py", line 936, in _update_ available_ resource by_uuid = self._update_ usage_from_ instances( rtox/nova/ functional- py38/nova/ compute/ resource_ tracker. py", line 1500, in _update_ usage_from_ instances _update_ usage_from_ instance( context, instance, nodename) rtox/nova/ functional- py38/nova/ compute/ resource_ tracker. py", line 1463, in _update_ usage_from_ instance _update_ usage(self. _get_usage_ dict(instance, instance), rtox/nova/ functional- py38/nova/ compute/ resource_ tracker. py", line 1268, in _update_usage numa_topology = hardware. numa_usage_ from_instance_ numa( rtox/nova/ functional- py38/nova/ virt/hardware. py", line 2382, in numa_usage_ from_instance_ numa cell.pin_ cpus(pinned_ cpus) rtox/nova/ functional- py38/nova/ objects/ numa.py" , line 95, in pin_cpus CPUPinningInval id(requested= list(cpus) , CPUPinningInval id: CPU set to pin [0] must be a subset of free CPU set [1]
Traceback (most recent call last):
File "/root/
self.
File "/root/
self.
File "/root/
return f(*args, **kwargs)
File "/root/
instance_
File "/root/
self.
File "/root/
self.
File "/root/
cn.
File "/root/
new_
File "/root/
raise exception.
nova.exception.
[1] https:/ /github. com/openstack/ nova/blob/ 7670303aabe16d1 d7c25e411d7bd41 3aee7fdcf3/ nova/compute/ resource_ tracker. py#L928- L929 /github. com/openstack/ nova/blob/ 7670303aabe16d1 d7c25e411d7bd41 3aee7fdcf3/ nova/compute/ manager. py#L5639- L5653 /github. com/openstack/ nova/blob/ 7670303aabe16d1 d7c25e411d7bd41 3aee7fdcf3/ nova/compute/ manager. py#L5780
[2] https:/
[3] https:/