OpenStack Compute (nova)

Bug #1953359
Activity log

Activity log for bug #1953359

Date	Who	What changed	Old value	New value	Message
2021-12-06 15:28:43	Balazs Gibizer	bug			added bug
2021-12-06 15:29:01	Balazs Gibizer	tags		numa
2021-12-06 15:29:11	Balazs Gibizer	tags	numa	compute numa resource-tracker
2021-12-06 15:29:16	Balazs Gibizer	tags	compute numa resource-tracker	compute numa resize resource-tracker
2021-12-06 15:36:48	Balazs Gibizer	nova: assignee		Balazs Gibizer (balazs-gibizer)
2021-12-06 16:15:54	OpenStack Infra	nova: status	New	In Progress
2021-12-06 16:28:27	Balazs Gibizer	nominated for series		nova/xena
2021-12-06 16:28:27	Balazs Gibizer	bug task added		nova/xena
2021-12-06 16:28:27	Balazs Gibizer	nominated for series		nova/wallaby
2021-12-06 16:28:27	Balazs Gibizer	bug task added		nova/wallaby
2021-12-06 16:28:27	Balazs Gibizer	nominated for series		nova/victoria
2021-12-06 16:28:27	Balazs Gibizer	bug task added		nova/victoria
2021-12-06 16:28:36	Balazs Gibizer	nova/victoria: status	New	In Progress
2021-12-06 16:28:39	Balazs Gibizer	nova/wallaby: status	New	In Progress
2021-12-06 16:28:45	Balazs Gibizer	nova/wallaby: assignee		Balazs Gibizer (balazs-gibizer)
2021-12-06 16:28:54	Balazs Gibizer	nova/wallaby: importance	Undecided	Medium
2021-12-06 16:29:04	Balazs Gibizer	nova/xena: importance	Undecided	Medium
2021-12-06 16:29:06	Balazs Gibizer	nova: importance	Undecided	Medium
2021-12-06 16:29:12	Balazs Gibizer	nova/victoria: importance	Undecided	Medium
2021-12-06 16:29:34	Balazs Gibizer	nova/victoria: assignee		Balazs Gibizer (balazs-gibizer)
2021-12-06 16:29:36	Balazs Gibizer	nova/xena: assignee		Balazs Gibizer (balazs-gibizer)
2021-12-07 08:30:41	Balazs Gibizer	marked as duplicate		1952915
2021-12-15 13:33:04	OpenStack Infra	nova: status	In Progress	Fix Released
2021-12-16 08:25:49	OpenStack Infra	nova/xena: status	New	In Progress
2022-01-07 17:29:10	OpenStack Infra	tags	compute numa resize resource-tracker	compute in-stable-xena numa resize resource-tracker
2022-01-10 22:14:03	OpenStack Infra	nova/xena: status	In Progress	Fix Committed
2022-01-12 17:46:51	OpenStack Infra	tags	compute in-stable-xena numa resize resource-tracker	compute in-stable-wallaby in-stable-xena numa resize resource-tracker
2022-02-04 15:43:17	OpenStack Infra	nova/wallaby: status	In Progress	Fix Committed
2022-02-11 02:06:02	OpenStack Infra	tags	compute in-stable-wallaby in-stable-xena numa resize resource-tracker	compute in-stable-victoria in-stable-wallaby in-stable-xena numa resize resource-tracker
2022-03-02 19:59:05	OpenStack Infra	nova/victoria: status	In Progress	Fix Committed
2022-03-10 12:39:04	OpenStack Infra	nova/victoria: status	Fix Committed	Fix Released
2022-03-10 12:51:27	OpenStack Infra	nova/wallaby: status	Fix Committed	Fix Released
2022-03-10 12:51:36	OpenStack Infra	nova/xena: status	Fix Committed	Fix Released
2022-06-15 03:29:06	OpenStack Infra	tags	compute in-stable-victoria in-stable-wallaby in-stable-xena numa resize resource-tracker	compute in-stable-ussuri in-stable-victoria in-stable-wallaby in-stable-xena numa resize resource-tracker
2024-06-21 15:40:43	Rodrigo Barbieri	summary	update_available_resource periodic fails with exception.CPUPinningInvalid if there is incoming post-migrating migration with cpu pinning	[SRU] update_available_resource periodic fails with exception.CPUPinningInvalid if there is incoming post-migrating migration with cpu pinning
2024-06-21 15:41:38	Rodrigo Barbieri	description	The update_available_resource() periodic task in the compute fails with exception.CPUPinningInvalid exception (and stop processing the rest of the instances) if there is an incoming migration (or resize or evacuation) that is in post-migrating state (not yet executed finish_resize) and the instance has CPU pinning. Reproduce: * build a multinode env with dedicated cpus and cpu pinning configured * configure the update_available_resource to run frequently (just to ease the reproduction of the race) (e.g. set [DEFAULT]update_resources_interval = 10) * create inst1 on the first node and create inst2 on the second node both with requesting one pinned cpu * check that inst1 pinned to the same pcpu id on node1 as inst2 on node2 * slow down the processing on finish_resize messages in the system to ease the reproduction of the race (e.g. inject sleep or load rabbit etc.) * migrate inst1 to node2 If you are managed to hit the case when the periodic runs on node2 just after the resize_claim of inst1 finished but the finish_resize RPC call of inst1 is not processed (the migration context is not applied to the instance and the migration is not in finished state but in post-migration) then you will see a CPU pinning conflict. It is because the resource tracker already tracks the incoming instance [1] (the host and node is set in resize_instance already[2]) but the instance still not have the migration context applied (as it is only done in finish_resize[3]) so the instance.numa_topology still points to the source topology. Reproduced both in stable/victoria downstream and in latest master in an upstream devstack. 2021-12-06 15:07:18,013 ERROR [nova.compute.manager] Error updating resources for node compute2. Traceback (most recent call last): File "/root/rtox/nova/functional-py38/nova/compute/manager.py", line 10011, in _update_available_resource_for_node self.rt.update_available_resource(context, nodename, File "/root/rtox/nova/functional-py38/nova/compute/resource_tracker.py", line 895, in update_available_resource self._update_available_resource(context, resources, startup=startup) File "/root/rtox/nova/functional-py38/.tox/functional-py38/lib/python3.8/site-packages/oslo_concurrency/lockutils.py", line 391, in inner return f(args, *kwargs) File "/root/rtox/nova/functional-py38/nova/compute/resource_tracker.py", line 936, in _update_available_resource instance_by_uuid = self._update_usage_from_instances( File "/root/rtox/nova/functional-py38/nova/compute/resource_tracker.py", line 1500, in _update_usage_from_instances self._update_usage_from_instance(context, instance, nodename) File "/root/rtox/nova/functional-py38/nova/compute/resource_tracker.py", line 1463, in _update_usage_from_instance self._update_usage(self._get_usage_dict(instance, instance), File "/root/rtox/nova/functional-py38/nova/compute/resource_tracker.py", line 1268, in _update_usage cn.numa_topology = hardware.numa_usage_from_instance_numa( File "/root/rtox/nova/functional-py38/nova/virt/hardware.py", line 2382, in numa_usage_from_instance_numa new_cell.pin_cpus(pinned_cpus) File "/root/rtox/nova/functional-py38/nova/objects/numa.py", line 95, in pin_cpus raise exception.CPUPinningInvalid(requested=list(cpus), nova.exception.CPUPinningInvalid: CPU set to pin [0] must be a subset of free CPU set [1] [1] https://github.com/openstack/nova/blob/7670303aabe16d1d7c25e411d7bd413aee7fdcf3/nova/compute/resource_tracker.py#L928-L929 [2] https://github.com/openstack/nova/blob/7670303aabe16d1d7c25e411d7bd413aee7fdcf3/nova/compute/manager.py#L5639-L5653 [3] https://github.com/openstack/nova/blob/7670303aabe16d1d7c25e411d7bd413aee7fdcf3/nova/compute/manager.py#L5780	* SRU TEMPLATE IS THE SAME AS https://bugs.launchpad.net/nova/+bug/1944759 AS BOTH FIXES COMPLEMENT EACH AND ARE BEING SRU'ed TOGETHER * The update_available_resource() periodic task in the compute fails with exception.CPUPinningInvalid exception (and stop processing the rest of the instances) if there is an incoming migration (or resize or evacuation) that is in post-migrating state (not yet executed finish_resize) and the instance has CPU pinning. Reproduce: * build a multinode env with dedicated cpus and cpu pinning configured * configure the update_available_resource to run frequently (just to ease the reproduction of the race) (e.g. set [DEFAULT]update_resources_interval = 10) * create inst1 on the first node and create inst2 on the second node both with requesting one pinned cpu * check that inst1 pinned to the same pcpu id on node1 as inst2 on node2 * slow down the processing on finish_resize messages in the system to ease the reproduction of the race (e.g. inject sleep or load rabbit etc.) * migrate inst1 to node2 If you are managed to hit the case when the periodic runs on node2 just after the resize_claim of inst1 finished but the finish_resize RPC call of inst1 is not processed (the migration context is not applied to the instance and the migration is not in finished state but in post-migration) then you will see a CPU pinning conflict. It is because the resource tracker already tracks the incoming instance [1] (the host and node is set in resize_instance already[2]) but the instance still not have the migration context applied (as it is only done in finish_resize[3]) so the instance.numa_topology still points to the source topology. Reproduced both in stable/victoria downstream and in latest master in an upstream devstack. 2021-12-06 15:07:18,013 ERROR [nova.compute.manager] Error updating resources for node compute2. Traceback (most recent call last): File "/root/rtox/nova/functional-py38/nova/compute/manager.py", line 10011, in _update_available_resource_for_node self.rt.update_available_resource(context, nodename, File "/root/rtox/nova/functional-py38/nova/compute/resource_tracker.py", line 895, in update_available_resource self._update_available_resource(context, resources, startup=startup) File "/root/rtox/nova/functional-py38/.tox/functional-py38/lib/python3.8/site-packages/oslo_concurrency/lockutils.py", line 391, in inner return f(args, *kwargs) File "/root/rtox/nova/functional-py38/nova/compute/resource_tracker.py", line 936, in _update_available_resource instance_by_uuid = self._update_usage_from_instances( File "/root/rtox/nova/functional-py38/nova/compute/resource_tracker.py", line 1500, in _update_usage_from_instances self._update_usage_from_instance(context, instance, nodename) File "/root/rtox/nova/functional-py38/nova/compute/resource_tracker.py", line 1463, in _update_usage_from_instance self._update_usage(self._get_usage_dict(instance, instance), File "/root/rtox/nova/functional-py38/nova/compute/resource_tracker.py", line 1268, in _update_usage cn.numa_topology = hardware.numa_usage_from_instance_numa( File "/root/rtox/nova/functional-py38/nova/virt/hardware.py", line 2382, in numa_usage_from_instance_numa new_cell.pin_cpus(pinned_cpus) File "/root/rtox/nova/functional-py38/nova/objects/numa.py", line 95, in pin_cpus raise exception.CPUPinningInvalid(requested=list(cpus), nova.exception.CPUPinningInvalid: CPU set to pin [0] must be a subset of free CPU set [1] [1] https://github.com/openstack/nova/blob/7670303aabe16d1d7c25e411d7bd413aee7fdcf3/nova/compute/resource_tracker.py#L928-L929 [2] https://github.com/openstack/nova/blob/7670303aabe16d1d7c25e411d7bd413aee7fdcf3/nova/compute/manager.py#L5639-L5653 [3] https://github.com/openstack/nova/blob/7670303aabe16d1d7c25e411d7bd413aee7fdcf3/nova/compute/manager.py#L5780
2024-07-03 11:15:21	Rodrigo Barbieri	tags	compute in-stable-ussuri in-stable-victoria in-stable-wallaby in-stable-xena numa resize resource-tracker	compute in-stable-ussuri in-stable-victoria in-stable-wallaby in-stable-xena numa resize resource-tracker sts-sru-needed