Ubuntu
nova package

Activity log for bug #1944759

Date	Who	What changed	Old value	New value	Message
2021-09-23 18:11:52	Balazs Gibizer	bug			added bug
2021-09-23 18:13:34	Balazs Gibizer	nova: assignee		Balazs Gibizer (balazs-gibizer)
2021-09-23 18:13:37	Balazs Gibizer	nova: importance	Undecided	Medium
2021-09-23 18:14:04	Balazs Gibizer	description	Nova has a race condition between resize_instance() compute manager call and the update_available_resources periodic job. If they overlap at the right place, when resize_instance calls finish_resize, then periodic job will not track the migration nor the instance on the source host. It causes that the PCPU allocation on the source host is dropped in the resource tracker (not in placement). Then when the resize is confirmed nova tries to free the pinned cpus again on the source host and fails with CPUUnpinningInvalid as they are already freed. I will push a reproduction test soon.	Nova has a race condition between resize_instance() compute manager call and the update_available_resources periodic job. If they overlap at the right place, when resize_instance calls finish_resize, then periodic job will not track the migration nor the instance on the source host. It causes that the PCPU allocation on the source host is dropped in the resource tracker (not in placement). Then when the resize is confirmed nova tries to free the pinned cpus again on the source host and fails with CPUUnpinningInvalid as they are already freed. I will push a reproduction test soon. It is reproducible at least on master, xena, wallaby, and victoria
2021-09-23 18:14:56	Balazs Gibizer	tags		compute numa race-condition resize
2021-09-23 18:15:20	Balazs Gibizer	description	Nova has a race condition between resize_instance() compute manager call and the update_available_resources periodic job. If they overlap at the right place, when resize_instance calls finish_resize, then periodic job will not track the migration nor the instance on the source host. It causes that the PCPU allocation on the source host is dropped in the resource tracker (not in placement). Then when the resize is confirmed nova tries to free the pinned cpus again on the source host and fails with CPUUnpinningInvalid as they are already freed. I will push a reproduction test soon. It is reproducible at least on master, xena, wallaby, and victoria	Nova has a race condition between resize_instance() compute manager call and the update_available_resources periodic job. If they overlap at the right place, when resize_instance calls finish_resize, then periodic job will not track the migration nor the instance on the source host. It causes that the PCPU allocation on the source host is dropped in the resource tracker (not in placement). Then when the resize is confirmed nova tries to free the pinned cpus again on the source host and fails with CPUUnpinningInvalid as they are already freed. I've pushed a reproduction test: https://review.opendev.org/c/openstack/nova/+/810763 It is reproducible at least on master, xena, wallaby, and victoria
2021-09-24 07:27:32	Sylvain Bauza	nova: status	New	Confirmed
2021-09-24 13:39:55	OpenStack Infra	nova: status	Confirmed	In Progress
2021-09-27 06:02:19	Mariusz Malek	bug			added subscriber Mariusz Malek
2021-09-29 07:32:16	Aleksander Wojtal	bug			added subscriber Aleksander Wojtal
2021-09-30 20:07:28	OpenStack Infra	nova: status	In Progress	Fix Released
2021-10-29 16:00:21	OpenStack Infra	tags	compute numa race-condition resize	compute in-stable-xena numa race-condition resize
2021-11-04 14:20:46	OpenStack Infra	tags	compute in-stable-xena numa race-condition resize	compute in-stable-wallaby in-stable-xena numa race-condition resize
2021-11-08 16:14:36	OpenStack Infra	tags	compute in-stable-wallaby in-stable-xena numa race-condition resize	compute in-stable-victoria in-stable-wallaby in-stable-xena numa race-condition resize
2024-05-07 19:36:09	Rodrigo Barbieri	summary	confirm resize fails with CPUUnpinningInvalid	[SRU] confirm resize fails with CPUUnpinningInvalid
2024-05-07 20:03:05	Rodrigo Barbieri	description	Nova has a race condition between resize_instance() compute manager call and the update_available_resources periodic job. If they overlap at the right place, when resize_instance calls finish_resize, then periodic job will not track the migration nor the instance on the source host. It causes that the PCPU allocation on the source host is dropped in the resource tracker (not in placement). Then when the resize is confirmed nova tries to free the pinned cpus again on the source host and fails with CPUUnpinningInvalid as they are already freed. I've pushed a reproduction test: https://review.opendev.org/c/openstack/nova/+/810763 It is reproducible at least on master, xena, wallaby, and victoria	* SRU DESCRIPTION BELOW * Nova has a race condition between resize_instance() compute manager call and the update_available_resources periodic job. If they overlap at the right place, when resize_instance calls finish_resize, then periodic job will not track the migration nor the instance on the source host. It causes that the PCPU allocation on the source host is dropped in the resource tracker (not in placement). Then when the resize is confirmed nova tries to free the pinned cpus again on the source host and fails with CPUUnpinningInvalid as they are already freed. I've pushed a reproduction test: https://review.opendev.org/c/openstack/nova/+/810763 It is reproducible at least on master, xena, wallaby, and victoria =============== SRU DESCRIPTION =============== [Impact] Due to a race condition the tracking of pinned CPU resources can go off-sync causing "No valid host" errors while being unable to create new instances with CPU pinning, as the previous pinned CPUs were not marked as freed. Part of the reason is addressed in the fix for LP#1953359 where a migration context is not pointing to the proper node during the race condition window, resulting in a CPUPinningInvalid error. This fix complements LP#1953359 by addressing the improper tracking of resources that happens only when the resource tracker periodic job runs in the source node while the flavor registered corresponds to the one of the destination. That is solved by setting the instance.old_flavor so the CPU pinning resources are tracked properly. [Test case] The test case for this was already implemented on non-live functional tests upstream: in nova/tests/functional/libvirt/test_numa_servers.py: - test_resize_dedicated_policy_race_on_dest_bug_1953359 - test_resize_confirm_bug_1944759 - test_resize_revert_bug_1944759 As this is a race condition it is very difficult to validate, even upstream, so the functional tests mock certain parts of the code to simulate the entire scope of the workflow. It is a non-live functional test, so it is more akin to a broader unit test. [Regression Potential] The code is considered stable today in newer releases and the scope of the code affected is fairly limited. Given that it is a race condition that it is difficult to validate, despite the non-live functional tests, the regression potential is moderate. [Other Info] None.
2024-05-07 20:41:40	Rodrigo Barbieri	attachment added		lp1953359_lp1944759_focal.debdiff https://bugs.launchpad.net/nova/+bug/1944759/+attachment/5776088/+files/lp1953359_lp1944759_focal.debdiff
2024-05-07 20:44:00	Rodrigo Barbieri	nominated for series		nova/ussuri
2024-05-07 20:44:00	Rodrigo Barbieri	bug task added		nova/ussuri
2024-05-07 20:44:15	Rodrigo Barbieri	bug task added		nova (Ubuntu)
2024-05-07 20:44:29	Rodrigo Barbieri	nominated for series		Ubuntu Focal
2024-05-07 20:44:29	Rodrigo Barbieri	bug task added		nova (Ubuntu Focal)
2024-05-07 20:44:45	Rodrigo Barbieri	bug task added		cloud-archive
2024-05-07 20:44:56	Rodrigo Barbieri	nominated for series		cloud-archive/ussuri
2024-05-07 20:44:56	Rodrigo Barbieri	bug task added		cloud-archive/ussuri
2024-05-07 20:45:48	Rodrigo Barbieri	tags	compute in-stable-victoria in-stable-wallaby in-stable-xena numa race-condition resize	compute in-stable-victoria in-stable-wallaby in-stable-xena numa race-condition resize sts-sru-needed
2024-05-08 12:08:47	Yiorgos Stamoulis	bug			added subscriber Yiorgos Stamoulis
2024-06-17 14:18:18	James Page	nova (Ubuntu): status	New	Invalid
2024-06-17 14:18:25	James Page	cloud-archive: status	New	Invalid
2024-06-17 14:20:29	James Page	cloud-archive/ussuri: status	New	Triaged
2024-06-17 14:20:32	James Page	nova (Ubuntu Focal): status	New	Triaged
2024-06-17 14:20:34	James Page	cloud-archive/ussuri: importance	Undecided	High
2024-06-17 14:20:36	James Page	nova (Ubuntu Focal): importance	Undecided	High
2024-06-21 15:40:15	Rodrigo Barbieri	description	* SRU DESCRIPTION BELOW * Nova has a race condition between resize_instance() compute manager call and the update_available_resources periodic job. If they overlap at the right place, when resize_instance calls finish_resize, then periodic job will not track the migration nor the instance on the source host. It causes that the PCPU allocation on the source host is dropped in the resource tracker (not in placement). Then when the resize is confirmed nova tries to free the pinned cpus again on the source host and fails with CPUUnpinningInvalid as they are already freed. I've pushed a reproduction test: https://review.opendev.org/c/openstack/nova/+/810763 It is reproducible at least on master, xena, wallaby, and victoria =============== SRU DESCRIPTION =============== [Impact] Due to a race condition the tracking of pinned CPU resources can go off-sync causing "No valid host" errors while being unable to create new instances with CPU pinning, as the previous pinned CPUs were not marked as freed. Part of the reason is addressed in the fix for LP#1953359 where a migration context is not pointing to the proper node during the race condition window, resulting in a CPUPinningInvalid error. This fix complements LP#1953359 by addressing the improper tracking of resources that happens only when the resource tracker periodic job runs in the source node while the flavor registered corresponds to the one of the destination. That is solved by setting the instance.old_flavor so the CPU pinning resources are tracked properly. [Test case] The test case for this was already implemented on non-live functional tests upstream: in nova/tests/functional/libvirt/test_numa_servers.py: - test_resize_dedicated_policy_race_on_dest_bug_1953359 - test_resize_confirm_bug_1944759 - test_resize_revert_bug_1944759 As this is a race condition it is very difficult to validate, even upstream, so the functional tests mock certain parts of the code to simulate the entire scope of the workflow. It is a non-live functional test, so it is more akin to a broader unit test. [Regression Potential] The code is considered stable today in newer releases and the scope of the code affected is fairly limited. Given that it is a race condition that it is difficult to validate, despite the non-live functional tests, the regression potential is moderate. [Other Info] None.	* SRU DESCRIPTION BELOW * Nova has a race condition between resize_instance() compute manager call and the update_available_resources periodic job. If they overlap at the right place, when resize_instance calls finish_resize, then periodic job will not track the migration nor the instance on the source host. It causes that the PCPU allocation on the source host is dropped in the resource tracker (not in placement). Then when the resize is confirmed nova tries to free the pinned cpus again on the source host and fails with CPUUnpinningInvalid as they are already freed. I've pushed a reproduction test: https://review.opendev.org/c/openstack/nova/+/810763 It is reproducible at least on master, xena, wallaby, and victoria =============== SRU DESCRIPTION =============== [Impact] Due to a race condition the tracking of pinned CPU resources can go off-sync causing "No valid host" errors while being unable to create new instances with CPU pinning, as the previous pinned CPUs were not marked as freed. Part of the reason is addressed in the fix for LP#1953359 where a migration context is not pointing to the proper node during the race condition window, resulting in a CPUPinningInvalid error. This fix complements LP#1953359 by addressing the improper tracking of resources that happens only when the resource tracker periodic job runs in the source node while the flavor registered corresponds to the one of the destination. That is solved by setting the instance.old_flavor so the CPU pinning resources are tracked properly. [Test case] The test case for this was already implemented on non-live functional tests upstream: in nova/tests/functional/libvirt/test_numa_servers.py: - test_resize_dedicated_policy_race_on_dest_bug_1953359 - test_resize_confirm_bug_1944759 - test_resize_revert_bug_1944759 As this is a race condition it is very difficult to validate, even upstream, so the functional tests mock certain parts of the code to simulate the entire scope of the workflow. It is a non-live functional test, so it is more akin to a broader unit test. The test case that will be run for this SRU is running the charmed-openstack-tester [1] against the environment containing the upgraded package (essentially as it would be in a point release SRU) and expect the test to pass. Test run evidence will be attached to LP. [Regression Potential] The code is considered stable today in newer releases and the scope of the code affected is fairly limited. Given that it is a race condition that it is difficult to validate, despite the non-live functional tests, the regression potential is moderate. [Other Info] None. [1] https://github.com/openstack-charmers/charmed-openstack-tester

Ubuntunova package

Activity log for bug #1944759

Ubuntu
nova package