CPUUnpinningUnknown exception thrown after failed Live Migration for instance with dedicated CPUs

Bug #1982497 reported by Balazs Gibizer
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Undecided
Unassigned

Bug Description

The instance cannot be deleted after a failed live migration as delete fails with nova.exception.CPUUnpinningInvalid: CPU set to unpin [2, 3] must be a subset of pinned CPU set [0, 1]

Steps to reproduce
------------------
1) create a multinode devstack with dedicated_cpu_set configured asymmetrically. host_a 0,1 host_b 2, 3
2) boot an instance on host_a with two dedicated CPUs. It will occupy 0,1
3) break live migration, i.e prevent the host_a to communicate with host_b
4) live migrate the instance. Nova will claim CPU 2, 3 on host_b
5) observer that the live migration failed and rolled back. The instance is running on host_a
6) try to delete the instance. It will fail as nova try to unpin CPU 2, 3 instead of CPU 0, 1 on host_a

2022-07-21 15:35:32,229 ERROR [nova.compute.manager] Setting instance vm_state to ERROR
Traceback (most recent call last):
  File "/build-bionic/nova/compute/manager.py", line 3060, in do_terminate_instance
    self._delete_instance(context, instance, bdms)
  File "/build-bionic/nova/compute/manager.py", line 3024, in _delete_instance
    self._complete_deletion(context, instance)
  File "/build-bionic/nova/compute/manager.py"
    , line 828, in _complete_deletion
    self._update_resource_tracker(context, instance)
  File "/build-bionic/nova/compute/manager.py", line 596, in _update_resource_tracker
    self.rt.update_usage(context, instance, instance.node)
  File "/build-bionic/.tox/functional-py38/lib/python3.8/site-packages/oslo_concurrency/lockutils.py", line 360, in inner
    return f(*args, **kwargs)
  File "/build-bionic/nova/compute/resource_tracker.py", line 656, in update_usage
    self._update_usage_from_instance(context, instance, nodename)
  File "/build-bionic/nova/compute/resource_tracker.py", line 1491, in _update_usage_from_instance
    self._update_usage(self._get_usage_dict(instance, instance),
  File "/build-bionic/nova/compute/resource_tracker.py", line 1295, in _update_usage
    cn.numa_topology = hardware.numa_usage_from_instance_numa(
  File "/build-bionic/nova/virt/hardware.py", line 2374, in numa_usage_from_instance_numa
    new_cell.unpin_cpus(pinned_cpus)
  File "/build-bionic/nova/objects/numa.py", line 106, in unpin_cpus
    raise exception.CPUUnpinningInvalid(requested=list(cpus),
nova.exception.CPUUnpinningInvalid: CPU set to unpin [2, 3] must be a subset of pinned CPU set [0, 1]

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/850672

tags: added: numa
tags: added: race-condition
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

It is caused by a race condition between rollback_live_migration_at_destination and drop_move_claim_at_destination RPC methods happening on the destination during the rollback of the live migration. The rollback_live_migration_at_destination is an RPC cast so it can run _after_ drop_move_claim_at_destination, which is an RPC call, run. The rollback_live_migration_at_destination RPC temporary applies the migration context [1] and calls instance.save during libvirt/driver._cleanup()[2][3]. If this happens as the last thing of the rollback then the instance numa topology will point to the dest host even though the instance runs and points to the source host.

[1] https://github.com/openstack/nova/blob/bcb96f362ab12e297f125daa5189fb66345b4976/nova/compute/manager.py#L9400-L9403
[2] https://github.com/openstack/nova/blob/bcb96f362ab12e297f125daa5189fb66345b4976/nova/virt/libvirt/driver.py#L10449
[3] https://github.com/openstack/nova/blob/bcb96f362ab12e297f125daa5189fb66345b4976/nova/virt/libvirt/driver.py#L1674

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/850746

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/851832

Changed in nova:
status: New → In Progress
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.