Resize/migrate intermittently fails to revert for instances with dedicated CPUs

Bug #1952915 reported by Gabriel Silva Trevisan
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Undecided
Gabriel Silva Trevisan
Ussuri
Fix Released
Undecided
Unassigned
Victoria
Fix Released
Undecided
Unassigned
Wallaby
Fix Released
Undecided
Unassigned
Xena
Fix Released
Undecided
Unassigned

Bug Description

Description
-----------------
Reverting a cold-migration/resize on an instance with dedicated CPU policy fails intermittently with status 400 and a "fault" message similar to the following:

"CPU set to unpin [8] must be a subset of pinned CPU set []"

The message above resembles what was reported on https://bugs.launchpad.net/nova/+bug/1879878 and https://bugs.launchpad.net/nova/+bug/1944759. A similar race-condition between the resize operation and the ComputeManager.update_available_resource() periodic task is suspected, but at a different point of the operation.

Steps to Reproduce
------------------
1. Create a flavor with property hw:cpu_policy=dedicated
2. Create an instance with the new flavor
3. Issue a cold-migration (or a resize with migration) to the new instance and wait for it to finish
4. Issue a resize-revert to the migrated instance and check its state

Expected Behavior
------------------
Instance migration/resize should be reverted successfully.

Actual Behavior
----------------
The error described occurs intermittently, causing the revert to fail, and leaving the instance on the destination node.

Seems to occur with small frequency, but may cause the revert operation to fail, without the possibility for a retry. This makes it more difficult to get the instance back to its previous node/configuration (see "Workaround" section). Verified that it happened both for cold-migration and resize.

From the tests performed on a two-node system, with update_resources_interval=60, it occurred at approximately 1-3% of the trials.

Environment
--------------------
Needs to have at least two compute nodes for migration. Reproduced the issue on a two-node system and on a multi-node system with dedicated storage.

Tested with branch stable/ussuri. Latest commit: 6667fcb92bfaf03a8a274dc26806c137aace6b49.

Also added some custom changes for testing/debugging purposes, listed below:

- To check if the issue still occurred with the fix from https://bugs.launchpad.net/nova/+bug/1944759:
    AFTER https://github.com/openstack/nova/blob/6667fcb92bfaf03a8a274dc26806c137aace6b49/nova/compute/manager.py#L5655, added:
        instance.old_flavor = instance.flavor

- Additional log messages for debugging. To give a better context to the "Timestamp/Logs" section, the most relevant ones are listed below:
    (Task update_available_resource started)
    AFTER https://github.com/openstack/nova/blob/6667fcb92bfaf03a8a274dc26806c137aace6b49/nova/compute/manager.py#L9841, added:
        LOG.warning("====== Started periodic task ======")

    (Instance saved with host changed to destination host, and old flavor set to original one (due to #1944759 fix))
    AFTER https://github.com/openstack/nova/blob/6667fcb92bfaf03a8a274dc26806c137aace6b49/nova/compute/manager.py#L5657, added:
        LOG.warning("====== Set instance host=dest and old_flavor=flavor ======")

    (Reached final resize step on destination)
    AFTER https://github.com/openstack/nova/blob/6667fcb92bfaf03a8a274dc26806c137aace6b49/nova/compute/manager.py#L5869, added:
        LOG.warning("====== Request reached finish_resize on dest compute ======")

    (Instance configuration (such as NUMA topology) updated with values for the destination host)
    AFTER https://github.com/openstack/nova/blob/6667fcb92bfaf03a8a274dc26806c137aace6b49/nova/compute/manager.py#L5795, added:
        LOG.warning("====== Set instance old_flavor=flavor, flavor=new_flavor and added _new ======")

    (Migration context created with given original and selected NUMA topologies)
    AFTER https://github.com/openstack/nova/blob/6667fcb92bfaf03a8a274dc26806c137aace6b49/nova/compute/resource_tracker.py#L310, added:
        LOG.info("====== Migration old topology: %(old_topology)s, ======", {'old_topology': mig_context.old_numa_topology}, instance=instance)
        LOG.info("====== Migration new topology: %(new_topology)s, ======", {'new_topology': mig_context.new_numa_topology}, instance=instance)

    (Unpinning host CPUs from resource_tracker's inventory)
    AFTER https://github.com/openstack/nova/blob/6667fcb92bfaf03a8a274dc26806c137aace6b49/nova/virt/hardware.py#L2247, added:
        LOG.warning(f"===== Unpinning CPUs {pinned_cpus} from {new_cell.pinned_cpus} ======")

    (Pinning host CPUs to resource_tracker's inventory)
    AFTER https://github.com/openstack/nova/blob/6667fcb92bfaf03a8a274dc26806c137aace6b49/nova/virt/hardware.py#L2253, added:
        LOG.warning(f"===== Pinning CPUs {pinned_cpus} to {new_cell.pinned_cpus} ======")

    (About to drop move claim for destination host during migrate-revert)
    AFTER https://github.com/openstack/nova/blob/6667fcb92bfaf03a8a274dc26806c137aace6b49/nova/compute/resource_tracker.py#L565, added:
        LOG.warning("====== Migration status reverted, dropping move claim ======")

Last Pass
---------
Due to intermittency, most revert attempts are successful. Not sure if there was a time when this scenario would always pass. Similar issues are https://bugs.launchpad.net/nova/+bug/1879878 and https://bugs.launchpad.net/nova/+bug/1944759. The fix from the former was present in the branch by the time the issue was observed. The fix from the latter was added manually for testing, and the issue persisted with it.

Timestamp/Logs
--------------
For reference, the logs below were captured for the following timeline of events (on a multi-node system with dedicated storage):
1. (16:47:10) Instance is created using dedicated CPU policy (with $INSTANCE_ID representing its ID)
2. (16:48:41) Command "nova migrate --poll $INSTANCE_ID" is issued to migrate the instance from node "compute-0" to "compute-1"
3. (16:48:59) Migration finishes, and instance moves to "compute-1"
4. (16:49:01) Command "openstack server resize revert $INSTANCE_ID" is issued to move the instance back to "compute-0"
5. (16:49:06) Instance moves to "ERROR" state with a "fault" message similar to the one in the description
6. (16:53:56) Instance moves back to "ACTIVE" state on "compute-1"

In this timespan, two exceptions occur: one between events 2 and 3, at 16:48:53; and another between events 4 and 5, at 16:49:06. Some of the relevant logs are shown below (some less relevant info, such as request IDs, and repeated dates were omitted for clarity):

==========
First exception (on nova_compute for compute-1):

2021-11-25T16:48:44.810551851Z stdout F 1306408 INFO nova.compute.resource_tracker [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86] ====== Migration old topology: InstanceNUMATopology: instance_uuid: None emulator_threads_policy: None InstanceNUMACell (id: 1) cpus: 0 cpu_pinning: {0: 15} reserved: None memory: 1024 pagesize: 1048576 cpu_topology: VirtCPUTopology(cores=1,sockets=1,threads=1) cpu_policy: dedicated cpu_thread_policy: None
[...]
2021-11-25T16:48:44.812188907Z stdout F 1306408 INFO nova.compute.resource_tracker [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86] ====== Migration new topology: InstanceNUMATopology: instance_uuid: None emulator_threads_policy: None InstanceNUMACell (id: 0) cpus: 0 cpu_pinning: {0: 8} reserved: None memory: 1024 pagesize: 1048576 cpu_topology: VirtCPUTopology(cores=1,sockets=1,threads=1) cpu_policy: dedicated cpu_thread_policy: None
[...]
2021-11-25T16:48:49.791057254Z stdout F 1182154 WARNING nova.compute.manager ====== Set instance host=dest and old_flavor=flavor ======
[...]
2021-11-25T16:48:49.833988502Z stdout F 1306408 WARNING nova.compute.manager ====== Request reached finish_resize on dest compute ======
[...]
2021-11-25T16:48:51.798085129Z stdout F 1306408 WARNING nova.compute.manager ====== Started periodic task ======
[...]
2021-11-25T16:48:53.267609465Z stdout F 1306408 WARNING nova.virt.hardware ===== Pinning CPUs {15} to CoercedSet() ======
2021-11-25T16:48:53.268609833Z stdout F 1306408 ERROR nova.compute.manager Error updating resources for node compute-1.: nova.exception.CPUPinningUnknown: CPU set to pin [15] must be a subset of known CPU set []
2021-11-25T16:48:53.26862307Z stdout F 1306408 ERROR nova.compute.manager Traceback (most recent call last):
2021-11-25T16:48:53.268629118Z stdout F 1306408 ERROR nova.compute.managerFile "/var/lib/openstack/lib/python3.6/site-packages/nova/compute/manager.py", line 9957, in _update_available_resource_for_node
2021-11-25T16:48:53.268633867Z stdout F 1306408 ERROR nova.compute.managerstartup=startup)
2021-11-25T16:48:53.26863832Z stdout F 1306408 ERROR nova.compute.managerFile "/var/lib/openstack/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line 943, in update_available_resource
[...]
2021-11-25T16:48:53.268690168Z stdout F 1306408 ERROR nova.compute.managernodename, sign=sign)
2021-11-25T16:48:53.268694117Z stdout F 1306408 ERROR nova.compute.managerFile "/var/lib/openstack/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line 1400, in _update_usage
2021-11-25T16:48:53.268698192Z stdout F 1306408 ERROR nova.compute.managerhost_numa_topology, instance_numa_topology, free)._to_json()
2021-11-25T16:48:53.268702169Z stdout F 1306408 ERROR nova.compute.managerFile "/var/lib/openstack/lib/python3.6/site-packages/nova/virt/hardware.py", line 2397, in numa_usage_from_instance_numa
2021-11-25T16:48:53.268708832Z stdout F 1306408 ERROR nova.compute.managernew_cell.pin_cpus(pinned_cpus)
2021-11-25T16:48:53.268712874Z stdout F 1306408 ERROR nova.compute.managerFile "/var/lib/openstack/lib/python3.6/site-packages/nova/objects/numa.py", line 87, in pin_cpus
2021-11-25T16:48:53.268717207Z stdout F 1306408 ERROR nova.compute.manageravailable=list(self.pcpuset))
2021-11-25T16:48:53.268721519Z stdout F 1306408 ERROR nova.compute.manager nova.exception.CPUPinningUnknown: CPU set to pin [15] must be a subset of known CPU set []
[...]
2021-11-25T16:48:53.365606052Z stdout F 1306408 WARNING nova.compute.manager ====== Set instance old_flavor=flavor, flavor=new_flavor and added _new ======

==========
Second exception (on nova_compute for compute-1):

2021-11-25T16:49:03.934849386Z stdout F 1306408 INFO nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86] Revert resize.
[...]
2021-11-25T16:49:06.686180122Z stdout F 1306408 WARNING nova.compute.resource_tracker ====== Migration status reverted, dropping move claim ======
[...]
2021-11-25T16:49:06.71669438Z stdout F 1306408 WARNING nova.virt.hardware ===== Unpinning CPUs {8} from CoercedSet() ======
2021-11-25T16:49:06.724405196Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86] Setting instance vm_state to ERROR: nova.exception.CPUUnpinningInvalid: CPU set to unpin [8] must be a subset of pinned CPU set []
2021-11-25T16:49:06.724419611Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86] Traceback (most recent call last):
2021-11-25T16:49:06.724425341Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86] File "/var/lib/openstack/lib/python3.6/site-packages/nova/compute/manager.py", line 10198, in _error_out_instance_on_exception
2021-11-25T16:49:06.724430797Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86] yield
2021-11-25T16:49:06.724435067Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86] File "/var/lib/openstack/lib/python3.6/site-packages/nova/compute/manager.py", line 4985, in revert_resize
2021-11-25T16:49:06.724441603Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86] self.rt.drop_move_claim_at_dest(context, instance, migration)
2021-11-25T16:49:06.72444586Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86] File "/var/lib/openstack/lib/python3.6/site-packages/oslo_concurrency/lockutils.py", line 359, in inner
2021-11-25T16:49:06.724449845Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86] return f(*args, **kwargs)
2021-11-25T16:49:06.724453921Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86] File "/var/lib/openstack/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line 619, in drop_move_claim_at_dest
2021-11-25T16:49:06.724457832Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86] prefix='new_')
2021-11-25T16:49:06.724479578Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86] File "/var/lib/openstack/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line 682, in _drop_move_claim
2021-11-25T16:49:06.724484245Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86] self._update_usage(usage, nodename, sign=-1)
2021-11-25T16:49:06.724488243Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86] File "/var/lib/openstack/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line 1400, in _update_usage
2021-11-25T16:49:06.724492348Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86] host_numa_topology, instance_numa_topology, free)._to_json()
2021-11-25T16:49:06.72449618Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86] File "/var/lib/openstack/lib/python3.6/site-packages/nova/virt/hardware.py", line 2390, in numa_usage_from_instance_numa
2021-11-25T16:49:06.724505037Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86] new_cell.unpin_cpus(pinned_cpus)
2021-11-25T16:49:06.724509637Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86] File "/var/lib/openstack/lib/python3.6/site-packages/nova/objects/numa.py", line 104, in unpin_cpus
2021-11-25T16:49:06.724513661Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86] self.pinned_cpus))
2021-11-25T16:49:06.724517574Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86] nova.exception.CPUUnpinningInvalid: CPU set to unpin [8] must be a subset of pinned CPU set []
2021-11-25T16:49:06.724521412Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86]
2021-11-25T16:49:06.729548497Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86] Setting instance vm_state to ERROR: nova.exception.CPUUnpinningInvalid: CPU set to unpin [8] must be a subset of pinned CPU set []

Workaround
----------
For migration: migrate instance back to source host after failure. Original source host may be retrieved from the migration list. Requires being able to specify it in the migrate request (admin-only, API version >= 2.56). After the revert failure, the instance should reboot on the destination host in ACTIVE state after about 3-4 minutes.

For resize: recreate the instance using the original flavor. The original source host may also be passed as a parameter (admin-only, API version >= 2.74).

Revision history for this message
Gabriel Silva Trevisan (g-trevisan) wrote :

Given the logs in the description, we believe that this issue happens when the update_available_resource task runs between two events during the migration:
    1. The instance being saved with its host/node set to destination
        - https://github.com/openstack/nova/blob/6667fcb92bfaf03a8a274dc26806c137aace6b49/nova/compute/manager.py#L5657
    2. The instance configuration (such as flavor and NUMA topology) being saved with the correct values for the destination host
        - https://github.com/openstack/nova/blob/6667fcb92bfaf03a8a274dc26806c137aace6b49/nova/compute/manager.py#L5795

During this interval, the task will consider that the instance belongs the destination host, but since it is still with its source configuration, it will attempt to pin the CPU numbers that were pinned for the source host. When this happens, the following possibilities may occur:
    1. The CPUs selected for the destination host coincide with the original ones on the source host
    2. The CPUs selected differ from the original ones, but are still free PCPUs on the destination
    3. The original CPUs are not part of the PCPU set on the destination host, or are already allocated

If scenario 1 occurs, the migration will be successful, and a subsequent revert will also work, since it will try to unpin the correct CPUs. Scenarios 2 and 3 will also migrate succesfully, but the PCPU inventory in resource_tracker will be temporarily incorrect, with 3 also throwing a CPUPinningUnknown or CPUPinningInvalid exception during the task.

If the revert is issued after the periodic task runs again, it will succeed in all scenarios, since the PCPUs will be correctly tracked by then. However, if the task runs afterwards, the tracking may still be incorrect, which causes the observed CPUUnpinningInvalid exception to be thrown, and the operation to fail.

We believe scenario 3 best explains the logs given in the description. As it can be observed, the original CPU pinning for the instance on "compute-0" is "{0: 15}", while the selected pinning for "compute-1" is "{0: 8}". When the periodic task runs, it tries to pin CPU 15 for "compute-1", which is not part of its PCPU set, causing the task to fail. Afterwards, the migration corrects the instance NUMA topology and starts it as expected, by pinning CPU 8 to it. However, since the internal tracking is not updated, the revert operation fails to unpin CPU 8, as it is not listed in the resource_tracker's inventory for "compute-1".

Revision history for this message
Gabriel Silva Trevisan (g-trevisan) wrote :

One way to make this issue more easily reproducible is to make the following changes:

1. Have both the source and destination hosts with mutually exclusive free PCPUs (e.g.: configure different PCPU sets, or allocate different PCPUs on both)

2. Either change CONF.update_resources_interval to a high value, such as "999", or modify the "spacing" parameter from https://github.com/openstack/nova/blob/6667fcb92bfaf03a8a274dc26806c137aace6b49/nova/compute/manager.py#L9840 to a high value.

3. After https://github.com/openstack/nova/blob/6667fcb92bfaf03a8a274dc26806c137aace6b49/nova/compute/manager.py#L5898, add:
    try:
        LOG.info("====== Calling update_available_resource ======")
        self.update_available_resource(context)
    except:
        pass

This way, the periodic task will always run while the instance is in the intermediate state. If both hosts select different CPU numbers for pinning, then the observed exception should occur on revert.

description: updated
Changed in nova:
assignee: nobody → Gabriel Silva Trevisan (g-trevisan)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/820381

Changed in nova:
status: New → In Progress
Revision history for this message
melanie witt (melwitt) wrote :

It looks like:

https://bugs.launchpad.net/nova/+bug/1953359

might be a duplicate of this issue?

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

I think the root cause of the two bugs are the same.

As https://bugs.launchpad.net/nova/+bug/1953359 is the newer I marked that as duplicate of this.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/victoria)

Related fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/820856

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/820859

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/victoria)

Change abandoned by "Balazs Gibizer <email address hidden>" on branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/820856
Reason: abandon it for now

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/820859
Committed: https://opendev.org/openstack/nova/commit/9f296d775d8f58fcbd03393c81a023268c7071cb
Submitter: "Zuul (22348)"
Branch: master

commit 9f296d775d8f58fcbd03393c81a023268c7071cb
Author: Balazs Gibizer <email address hidden>
Date: Mon Dec 6 16:36:41 2021 +0100

    Extend the reproducer for 1953359 and 1952915

    This patch extends the original reproduction
    I4be429c56aaa15ee12f448978c38214e741eae63 to cover
    bug 1952915 as well as they have a common root cause.

    Change-Id: I57982131768d87e067d1413012b96f1baa68052b
    Related-Bug: #1953359
    Related-Bug: #1952915

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/820549
Committed: https://opendev.org/openstack/nova/commit/32c1044d86a8d02712c8e3abdf8b3e4cff234a9c
Submitter: "Zuul (22348)"
Branch: master

commit 32c1044d86a8d02712c8e3abdf8b3e4cff234a9c
Author: Balazs Gibizer <email address hidden>
Date: Mon Dec 6 17:06:51 2021 +0100

    [rt] Apply migration context for incoming migrations

    There is a race condition between an incoming resize and an
    update_available_resource periodic in the resource tracker. The race
    window starts when the resize_instance RPC finishes and ends when the
    finish_resize compute RPC finally applies the migration context on the
    instance.

    In the race window, if the update_available_resource periodic is run on
    the destination node, then it will see the instance as being tracked on
    this host as the instance.node is already pointing to the dest. But the
    instance.numa_topology still points to the source host topology as the
    migration context is not applied yet. This leads to CPU pinning error if
    the source topology does not fit to the dest topology. Also it stops the
    periodic task and leaves the tracker in an inconsistent state. The
    inconsistent state only cleanup up after the periodic is run outside of
    the race window.

    This patch applies the migration context temporarily to the specific
    instances during the periodic to keep resource accounting correct.

    Change-Id: Icaad155e22c9e2d86e464a0deb741c73f0dfb28a
    Closes-Bug: #1953359
    Closes-Bug: #1952915

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/xena)

Related fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/nova/+/821941

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/wallaby)

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/821943

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/ussuri)

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/822048

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/822050

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/nova/+/821941
Committed: https://opendev.org/openstack/nova/commit/0411962938ae1de39f8dccb03efe4567f82ad671
Submitter: "Zuul (22348)"
Branch: stable/xena

commit 0411962938ae1de39f8dccb03efe4567f82ad671
Author: Balazs Gibizer <email address hidden>
Date: Mon Dec 6 16:36:41 2021 +0100

    Extend the reproducer for 1953359 and 1952915

    This patch extends the original reproduction
    I4be429c56aaa15ee12f448978c38214e741eae63 to cover
    bug 1952915 as well as they have a common root cause.

    Change-Id: I57982131768d87e067d1413012b96f1baa68052b
    Related-Bug: #1953359
    Related-Bug: #1952915
    (cherry picked from commit 9f296d775d8f58fcbd03393c81a023268c7071cb)

tags: added: in-stable-xena
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/nova/+/820553
Committed: https://opendev.org/openstack/nova/commit/1235dc324ebc1c6ac6dc94da0f45ffffcc546d2c
Submitter: "Zuul (22348)"
Branch: stable/xena

commit 1235dc324ebc1c6ac6dc94da0f45ffffcc546d2c
Author: Balazs Gibizer <email address hidden>
Date: Mon Dec 6 17:06:51 2021 +0100

    [rt] Apply migration context for incoming migrations

    There is a race condition between an incoming resize and an
    update_available_resource periodic in the resource tracker. The race
    window starts when the resize_instance RPC finishes and ends when the
    finish_resize compute RPC finally applies the migration context on the
    instance.

    In the race window, if the update_available_resource periodic is run on
    the destination node, then it will see the instance as being tracked on
    this host as the instance.node is already pointing to the dest. But the
    instance.numa_topology still points to the source host topology as the
    migration context is not applied yet. This leads to CPU pinning error if
    the source topology does not fit to the dest topology. Also it stops the
    periodic task and leaves the tracker in an inconsistent state. The
    inconsistent state only cleanup up after the periodic is run outside of
    the race window.

    This patch applies the migration context temporarily to the specific
    instances during the periodic to keep resource accounting correct.

    Change-Id: Icaad155e22c9e2d86e464a0deb741c73f0dfb28a
    Closes-Bug: #1953359
    Closes-Bug: #1952915
    (cherry picked from commit 32c1044d86a8d02712c8e3abdf8b3e4cff234a9c)

Revision history for this message
melanie witt (melwitt) wrote :

Hi, the Fix Released status means that the change has been included in a point release of the stable/xena branch, which has not occurred yet. Until then, the change is Fix Committed, i.e. present in the stable/xena branch but not yet included in an official release. When the stable/xena branch is released via the openstack/releases repo [1], all associated bug status will be updated to Fix Released automatically.

Changing status back to Fix Committed.

[1] https://releases.openstack.org/reference/using.html#requesting-a-release

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/nova/+/821943
Committed: https://opendev.org/openstack/nova/commit/94f17be190cce060ba8afcafbade4247b27b86f0
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 94f17be190cce060ba8afcafbade4247b27b86f0
Author: Balazs Gibizer <email address hidden>
Date: Mon Dec 6 16:36:41 2021 +0100

    Extend the reproducer for 1953359 and 1952915

    This patch extends the original reproduction
    I4be429c56aaa15ee12f448978c38214e741eae63 to cover
    bug 1952915 as well as they have a common root cause.

    Change-Id: I57982131768d87e067d1413012b96f1baa68052b
    Related-Bug: #1953359
    Related-Bug: #1952915
    (cherry picked from commit 9f296d775d8f58fcbd03393c81a023268c7071cb)
    (cherry picked from commit 0411962938ae1de39f8dccb03efe4567f82ad671)

tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/nova/+/820555
Committed: https://opendev.org/openstack/nova/commit/5f2f283a75243d2e2629d3c5f7e5ef4b3994972d
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 5f2f283a75243d2e2629d3c5f7e5ef4b3994972d
Author: Balazs Gibizer <email address hidden>
Date: Mon Dec 6 17:06:51 2021 +0100

    [rt] Apply migration context for incoming migrations

    There is a race condition between an incoming resize and an
    update_available_resource periodic in the resource tracker. The race
    window starts when the resize_instance RPC finishes and ends when the
    finish_resize compute RPC finally applies the migration context on the
    instance.

    In the race window, if the update_available_resource periodic is run on
    the destination node, then it will see the instance as being tracked on
    this host as the instance.node is already pointing to the dest. But the
    instance.numa_topology still points to the source host topology as the
    migration context is not applied yet. This leads to CPU pinning error if
    the source topology does not fit to the dest topology. Also it stops the
    periodic task and leaves the tracker in an inconsistent state. The
    inconsistent state only cleanup up after the periodic is run outside of
    the race window.

    This patch applies the migration context temporarily to the specific
    instances during the periodic to keep resource accounting correct.

    Change-Id: Icaad155e22c9e2d86e464a0deb741c73f0dfb28a
    Closes-Bug: #1953359
    Closes-Bug: #1952915
    (cherry picked from commit 32c1044d86a8d02712c8e3abdf8b3e4cff234a9c)
    (cherry picked from commit 1235dc324ebc1c6ac6dc94da0f45ffffcc546d2c)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by "Gabriel Silva Trevisan <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/nova/+/820381
Reason: Abandoning in favor of https://review.opendev.org/c/openstack/nova/+/820549

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/victoria)

Reviewed: https://review.opendev.org/c/openstack/nova/+/820856
Committed: https://opendev.org/openstack/nova/commit/8d4487465b60cd165dc76dea5a9fdb3c4dbf5740
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit 8d4487465b60cd165dc76dea5a9fdb3c4dbf5740
Author: Balazs Gibizer <email address hidden>
Date: Mon Dec 6 16:36:41 2021 +0100

    Extend the reproducer for 1953359 and 1952915

    This patch extends the original reproduction
    I4be429c56aaa15ee12f448978c38214e741eae63 to cover
    bug 1952915 as well as they have a common root cause.

    Change-Id: I57982131768d87e067d1413012b96f1baa68052b
    Related-Bug: #1953359
    Related-Bug: #1952915
    (cherry picked from commit 9f296d775d8f58fcbd03393c81a023268c7071cb)
    (cherry picked from commit 0411962938ae1de39f8dccb03efe4567f82ad671)
    (cherry picked from commit 94f17be190cce060ba8afcafbade4247b27b86f0)

tags: added: in-stable-victoria
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/victoria)

Reviewed: https://review.opendev.org/c/openstack/nova/+/820559
Committed: https://opendev.org/openstack/nova/commit/d54bd316b331d439a26a7318ca68cab5f6280ab2
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit d54bd316b331d439a26a7318ca68cab5f6280ab2
Author: Balazs Gibizer <email address hidden>
Date: Mon Dec 6 17:06:51 2021 +0100

    [rt] Apply migration context for incoming migrations

    There is a race condition between an incoming resize and an
    update_available_resource periodic in the resource tracker. The race
    window starts when the resize_instance RPC finishes and ends when the
    finish_resize compute RPC finally applies the migration context on the
    instance.

    In the race window, if the update_available_resource periodic is run on
    the destination node, then it will see the instance as being tracked on
    this host as the instance.node is already pointing to the dest. But the
    instance.numa_topology still points to the source host topology as the
    migration context is not applied yet. This leads to CPU pinning error if
    the source topology does not fit to the dest topology. Also it stops the
    periodic task and leaves the tracker in an inconsistent state. The
    inconsistent state only cleanup up after the periodic is run outside of
    the race window.

    This patch applies the migration context temporarily to the specific
    instances during the periodic to keep resource accounting correct.

    Change-Id: Icaad155e22c9e2d86e464a0deb741c73f0dfb28a
    Closes-Bug: #1953359
    Closes-Bug: #1952915
    (cherry picked from commit 32c1044d86a8d02712c8e3abdf8b3e4cff234a9c)
    (cherry picked from commit 1235dc324ebc1c6ac6dc94da0f45ffffcc546d2c)
    (cherry picked from commit 5f2f283a75243d2e2629d3c5f7e5ef4b3994972d)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 22.4.0

This issue was fixed in the openstack/nova 22.4.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 23.2.0

This issue was fixed in the openstack/nova 23.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 24.1.0

This issue was fixed in the openstack/nova 24.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 25.0.0.0rc1

This issue was fixed in the openstack/nova 25.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/839354

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/839355

Revision history for this message
johjuhyun (juhyun-joh) wrote :

Hello. Do you have any plan to backport it to rocky branch?

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

@johjuhyun: Hi! The backport to rocky is not on my TODO list right now. So feel free to propose the backports to stein and rocky if you need them.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/ussuri)

Reviewed: https://review.opendev.org/c/openstack/nova/+/822048
Committed: https://opendev.org/openstack/nova/commit/9b8e5cec303a621824366e1794665d6b849fefad
Submitter: "Zuul (22348)"
Branch: stable/ussuri

commit 9b8e5cec303a621824366e1794665d6b849fefad
Author: Balazs Gibizer <email address hidden>
Date: Mon Dec 6 16:36:41 2021 +0100

    Extend the reproducer for 1953359 and 1952915

    This patch extends the original reproduction
    I4be429c56aaa15ee12f448978c38214e741eae63 to cover
    bug 1952915 as well as they have a common root cause.

    Change-Id: I57982131768d87e067d1413012b96f1baa68052b
    Related-Bug: #1953359
    Related-Bug: #1952915
    (cherry picked from commit 9f296d775d8f58fcbd03393c81a023268c7071cb)
    (cherry picked from commit 0411962938ae1de39f8dccb03efe4567f82ad671)
    (cherry picked from commit 94f17be190cce060ba8afcafbade4247b27b86f0)
    (cherry picked from commit 8d4487465b60cd165dc76dea5a9fdb3c4dbf5740)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/ussuri)

Reviewed: https://review.opendev.org/c/openstack/nova/+/822050
Committed: https://opendev.org/openstack/nova/commit/1d0b7051da430ed00ae49901a32ec6af46c1a64e
Submitter: "Zuul (22348)"
Branch: stable/ussuri

commit 1d0b7051da430ed00ae49901a32ec6af46c1a64e
Author: Balazs Gibizer <email address hidden>
Date: Mon Dec 6 17:06:51 2021 +0100

    [rt] Apply migration context for incoming migrations

    There is a race condition between an incoming resize and an
    update_available_resource periodic in the resource tracker. The race
    window starts when the resize_instance RPC finishes and ends when the
    finish_resize compute RPC finally applies the migration context on the
    instance.

    In the race window, if the update_available_resource periodic is run on
    the destination node, then it will see the instance as being tracked on
    this host as the instance.node is already pointing to the dest. But the
    instance.numa_topology still points to the source host topology as the
    migration context is not applied yet. This leads to CPU pinning error if
    the source topology does not fit to the dest topology. Also it stops the
    periodic task and leaves the tracker in an inconsistent state. The
    inconsistent state only cleanup up after the periodic is run outside of
    the race window.

    This patch applies the migration context temporarily to the specific
    instances during the periodic to keep resource accounting correct.

    Conflicts: on resource_tracker: changed
    'MigrationList.get_in_progress_and_error' call back to
    'MigrationList.get_in_progress_by_host_and_node', since this change was
    only added by 255b3f2f918843ca5dd9b99e109ecd2189b6b749, and is not
    present in stable/ussuri.

    Change-Id: Icaad155e22c9e2d86e464a0deb741c73f0dfb28a
    Closes-Bug: #1953359
    Closes-Bug: #1952915
    (cherry picked from commit 32c1044d86a8d02712c8e3abdf8b3e4cff234a9c)
    (cherry picked from commit 1235dc324ebc1c6ac6dc94da0f45ffffcc546d2c)
    (cherry picked from commit 5f2f283a75243d2e2629d3c5f7e5ef4b3994972d)
    (cherry picked from commit d54bd316b331d439a26a7318ca68cab5f6280ab2)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/train)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/839354
Reason: stable/train branch of nova projects' have been tagged as End of Life. All open patches have to be abandoned in order to be able to delete the branch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/839355
Reason: stable/train branch of nova projects' have been tagged as End of Life. All open patches have to be abandoned in order to be able to delete the branch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova ussuri-eol

This issue was fixed in the openstack/nova ussuri-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.