Activity log for bug #1944759

Date Who What changed Old value New value Message
2021-09-23 18:11:52 Balazs Gibizer bug added bug
2021-09-23 18:13:34 Balazs Gibizer nova: assignee Balazs Gibizer (balazs-gibizer)
2021-09-23 18:13:37 Balazs Gibizer nova: importance Undecided Medium
2021-09-23 18:14:04 Balazs Gibizer description Nova has a race condition between resize_instance() compute manager call and the update_available_resources periodic job. If they overlap at the right place, when resize_instance calls finish_resize, then periodic job will not track the migration nor the instance on the source host. It causes that the PCPU allocation on the source host is dropped in the resource tracker (not in placement). Then when the resize is confirmed nova tries to free the pinned cpus again on the source host and fails with CPUUnpinningInvalid as they are already freed. I will push a reproduction test soon. Nova has a race condition between resize_instance() compute manager call and the update_available_resources periodic job. If they overlap at the right place, when resize_instance calls finish_resize, then periodic job will not track the migration nor the instance on the source host. It causes that the PCPU allocation on the source host is dropped in the resource tracker (not in placement). Then when the resize is confirmed nova tries to free the pinned cpus again on the source host and fails with CPUUnpinningInvalid as they are already freed. I will push a reproduction test soon. It is reproducible at least on master, xena, wallaby, and victoria
2021-09-23 18:14:56 Balazs Gibizer tags compute numa race-condition resize
2021-09-23 18:15:20 Balazs Gibizer description Nova has a race condition between resize_instance() compute manager call and the update_available_resources periodic job. If they overlap at the right place, when resize_instance calls finish_resize, then periodic job will not track the migration nor the instance on the source host. It causes that the PCPU allocation on the source host is dropped in the resource tracker (not in placement). Then when the resize is confirmed nova tries to free the pinned cpus again on the source host and fails with CPUUnpinningInvalid as they are already freed. I will push a reproduction test soon. It is reproducible at least on master, xena, wallaby, and victoria Nova has a race condition between resize_instance() compute manager call and the update_available_resources periodic job. If they overlap at the right place, when resize_instance calls finish_resize, then periodic job will not track the migration nor the instance on the source host. It causes that the PCPU allocation on the source host is dropped in the resource tracker (not in placement). Then when the resize is confirmed nova tries to free the pinned cpus again on the source host and fails with CPUUnpinningInvalid as they are already freed. I've pushed a reproduction test: https://review.opendev.org/c/openstack/nova/+/810763 It is reproducible at least on master, xena, wallaby, and victoria
2021-09-24 07:27:32 Sylvain Bauza nova: status New Confirmed
2021-09-24 13:39:55 OpenStack Infra nova: status Confirmed In Progress
2021-09-27 06:02:19 Mariusz Malek bug added subscriber Mariusz Malek
2021-09-29 07:32:16 Aleksander Wojtal bug added subscriber Aleksander Wojtal
2021-09-30 20:07:28 OpenStack Infra nova: status In Progress Fix Released
2021-10-29 16:00:21 OpenStack Infra tags compute numa race-condition resize compute in-stable-xena numa race-condition resize
2021-11-04 14:20:46 OpenStack Infra tags compute in-stable-xena numa race-condition resize compute in-stable-wallaby in-stable-xena numa race-condition resize
2021-11-08 16:14:36 OpenStack Infra tags compute in-stable-wallaby in-stable-xena numa race-condition resize compute in-stable-victoria in-stable-wallaby in-stable-xena numa race-condition resize
2024-05-07 19:36:09 Rodrigo Barbieri summary confirm resize fails with CPUUnpinningInvalid [SRU] confirm resize fails with CPUUnpinningInvalid
2024-05-07 20:03:05 Rodrigo Barbieri description Nova has a race condition between resize_instance() compute manager call and the update_available_resources periodic job. If they overlap at the right place, when resize_instance calls finish_resize, then periodic job will not track the migration nor the instance on the source host. It causes that the PCPU allocation on the source host is dropped in the resource tracker (not in placement). Then when the resize is confirmed nova tries to free the pinned cpus again on the source host and fails with CPUUnpinningInvalid as they are already freed. I've pushed a reproduction test: https://review.opendev.org/c/openstack/nova/+/810763 It is reproducible at least on master, xena, wallaby, and victoria * SRU DESCRIPTION BELOW * Nova has a race condition between resize_instance() compute manager call and the update_available_resources periodic job. If they overlap at the right place, when resize_instance calls finish_resize, then periodic job will not track the migration nor the instance on the source host. It causes that the PCPU allocation on the source host is dropped in the resource tracker (not in placement). Then when the resize is confirmed nova tries to free the pinned cpus again on the source host and fails with CPUUnpinningInvalid as they are already freed. I've pushed a reproduction test: https://review.opendev.org/c/openstack/nova/+/810763 It is reproducible at least on master, xena, wallaby, and victoria =============== SRU DESCRIPTION =============== [Impact] Due to a race condition the tracking of pinned CPU resources can go off-sync causing "No valid host" errors while being unable to create new instances with CPU pinning, as the previous pinned CPUs were not marked as freed. Part of the reason is addressed in the fix for LP#1953359 where a migration context is not pointing to the proper node during the race condition window, resulting in a CPUPinningInvalid error. This fix complements LP#1953359 by addressing the improper tracking of resources that happens only when the resource tracker periodic job runs in the source node while the flavor registered corresponds to the one of the destination. That is solved by setting the instance.old_flavor so the CPU pinning resources are tracked properly. [Test case] The test case for this was already implemented on non-live functional tests upstream: in nova/tests/functional/libvirt/test_numa_servers.py: - test_resize_dedicated_policy_race_on_dest_bug_1953359 - test_resize_confirm_bug_1944759 - test_resize_revert_bug_1944759 As this is a race condition it is very difficult to validate, even upstream, so the functional tests mock certain parts of the code to simulate the entire scope of the workflow. It is a non-live functional test, so it is more akin to a broader unit test. [Regression Potential] The code is considered stable today in newer releases and the scope of the code affected is fairly limited. Given that it is a race condition that it is difficult to validate, despite the non-live functional tests, the regression potential is moderate. [Other Info] None.
2024-05-07 20:41:40 Rodrigo Barbieri attachment added lp1953359_lp1944759_focal.debdiff https://bugs.launchpad.net/nova/+bug/1944759/+attachment/5776088/+files/lp1953359_lp1944759_focal.debdiff
2024-05-07 20:44:00 Rodrigo Barbieri nominated for series nova/ussuri
2024-05-07 20:44:00 Rodrigo Barbieri bug task added nova/ussuri
2024-05-07 20:44:15 Rodrigo Barbieri bug task added nova (Ubuntu)
2024-05-07 20:44:29 Rodrigo Barbieri nominated for series Ubuntu Focal
2024-05-07 20:44:29 Rodrigo Barbieri bug task added nova (Ubuntu Focal)
2024-05-07 20:44:45 Rodrigo Barbieri bug task added cloud-archive
2024-05-07 20:44:56 Rodrigo Barbieri nominated for series cloud-archive/ussuri
2024-05-07 20:44:56 Rodrigo Barbieri bug task added cloud-archive/ussuri
2024-05-07 20:45:48 Rodrigo Barbieri tags compute in-stable-victoria in-stable-wallaby in-stable-xena numa race-condition resize compute in-stable-victoria in-stable-wallaby in-stable-xena numa race-condition resize sts-sru-needed
2024-05-08 12:08:47 Yiorgos Stamoulis bug added subscriber Yiorgos Stamoulis
2024-06-17 14:18:18 James Page nova (Ubuntu): status New Invalid
2024-06-17 14:18:25 James Page cloud-archive: status New Invalid
2024-06-17 14:20:29 James Page cloud-archive/ussuri: status New Triaged
2024-06-17 14:20:32 James Page nova (Ubuntu Focal): status New Triaged
2024-06-17 14:20:34 James Page cloud-archive/ussuri: importance Undecided High
2024-06-17 14:20:36 James Page nova (Ubuntu Focal): importance Undecided High
2024-06-21 15:40:15 Rodrigo Barbieri description * SRU DESCRIPTION BELOW * Nova has a race condition between resize_instance() compute manager call and the update_available_resources periodic job. If they overlap at the right place, when resize_instance calls finish_resize, then periodic job will not track the migration nor the instance on the source host. It causes that the PCPU allocation on the source host is dropped in the resource tracker (not in placement). Then when the resize is confirmed nova tries to free the pinned cpus again on the source host and fails with CPUUnpinningInvalid as they are already freed. I've pushed a reproduction test: https://review.opendev.org/c/openstack/nova/+/810763 It is reproducible at least on master, xena, wallaby, and victoria =============== SRU DESCRIPTION =============== [Impact] Due to a race condition the tracking of pinned CPU resources can go off-sync causing "No valid host" errors while being unable to create new instances with CPU pinning, as the previous pinned CPUs were not marked as freed. Part of the reason is addressed in the fix for LP#1953359 where a migration context is not pointing to the proper node during the race condition window, resulting in a CPUPinningInvalid error. This fix complements LP#1953359 by addressing the improper tracking of resources that happens only when the resource tracker periodic job runs in the source node while the flavor registered corresponds to the one of the destination. That is solved by setting the instance.old_flavor so the CPU pinning resources are tracked properly. [Test case] The test case for this was already implemented on non-live functional tests upstream: in nova/tests/functional/libvirt/test_numa_servers.py: - test_resize_dedicated_policy_race_on_dest_bug_1953359 - test_resize_confirm_bug_1944759 - test_resize_revert_bug_1944759 As this is a race condition it is very difficult to validate, even upstream, so the functional tests mock certain parts of the code to simulate the entire scope of the workflow. It is a non-live functional test, so it is more akin to a broader unit test. [Regression Potential] The code is considered stable today in newer releases and the scope of the code affected is fairly limited. Given that it is a race condition that it is difficult to validate, despite the non-live functional tests, the regression potential is moderate. [Other Info] None. * SRU DESCRIPTION BELOW * Nova has a race condition between resize_instance() compute manager call and the update_available_resources periodic job. If they overlap at the right place, when resize_instance calls finish_resize, then periodic job will not track the migration nor the instance on the source host. It causes that the PCPU allocation on the source host is dropped in the resource tracker (not in placement). Then when the resize is confirmed nova tries to free the pinned cpus again on the source host and fails with CPUUnpinningInvalid as they are already freed. I've pushed a reproduction test: https://review.opendev.org/c/openstack/nova/+/810763 It is reproducible at least on master, xena, wallaby, and victoria =============== SRU DESCRIPTION =============== [Impact] Due to a race condition the tracking of pinned CPU resources can go off-sync causing "No valid host" errors while being unable to create new instances with CPU pinning, as the previous pinned CPUs were not marked as freed. Part of the reason is addressed in the fix for LP#1953359 where a migration context is not pointing to the proper node during the race condition window, resulting in a CPUPinningInvalid error. This fix complements LP#1953359 by addressing the improper tracking of resources that happens only when the resource tracker periodic job runs in the source node while the flavor registered corresponds to the one of the destination. That is solved by setting the instance.old_flavor so the CPU pinning resources are tracked properly. [Test case] The test case for this was already implemented on non-live functional tests upstream: in nova/tests/functional/libvirt/test_numa_servers.py: - test_resize_dedicated_policy_race_on_dest_bug_1953359 - test_resize_confirm_bug_1944759 - test_resize_revert_bug_1944759 As this is a race condition it is very difficult to validate, even upstream, so the functional tests mock certain parts of the code to simulate the entire scope of the workflow. It is a non-live functional test, so it is more akin to a broader unit test. The test case that will be run for this SRU is running the charmed-openstack-tester [1] against the environment containing the upgraded package (essentially as it would be in a point release SRU) and expect the test to pass. Test run evidence will be attached to LP. [Regression Potential] The code is considered stable today in newer releases and the scope of the code affected is fairly limited. Given that it is a race condition that it is difficult to validate, despite the non-live functional tests, the regression potential is moderate. [Other Info] None. [1] https://github.com/openstack-charmers/charmed-openstack-tester