Port update exception on nova unshelve for instance with PCI devices (part 2)

Bug #1851545 reported by David Vallee Delisle
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Unassigned
Wallaby
Fix Committed
Undecided
Unassigned

Bug Description

Description
===========
When unshelving an instance with PCI devices, and another instance is already using the PCI device(s) that the unshelved instance was initially scheduled with, we get an exception.

Steps to reproduce
==================
- Create instance with SR-IOV
- Shelve instance
- Unshelve instance on a compute node with the same PCI device(s) already in use

Expected result
===============
We should recalculate the pci mapping to use new PCI device(s)

Actual result
=============
Nova compute fails with this traceback [a].

This analysis was made when testing with newton, but it's the same problem with supported upstream, at least up to queens.

- When we we have a failure, we see "Updating port 991cbd39-47f7-4cab-bf65-0c19a920a718 with attributes {'binding:host_id': 'xxx'}" which brings us here [1]
- when we look below [2], we see that the pci devices are never recalculated and the profile is not updated with new devices when we unshelve because this only happens in case of a migration.
- That brings us back to this commit [3] and this upstream bug [4]
- I would assume that if we remove the "migration is not None" test, we will fail with this bug [4] because we get the pci_mapping from a migration object

Now I'm not sure how to generate the pci_mapping without a migration object/context.

[1] https://github.com/openstack/nova/blob/newton-eol/nova/network/neutronv2/api.py#L2405-L2411
[2] https://github.com/openstack/nova/blob/newton-eol/nova/network/neutronv2/api.py#L2417-L2418
[3] https://github.com/openstack/nova/commit/70c1eb689ad174b61ad915ae5384778bd536c16c
[4] https://bugs.launchpad.net/nova/+bug/1677621/

Logs & Configs
==============

[a]
~~~
nova-compute.log:2019-10-31 20:31:48.216 680184 ERROR nova.compute.manager [instance: 4fd6c244-238c-4e75-a856-3713163f4d17] Traceback (most recent call last):
nova-compute.log:2019-10-31 20:31:48.216 680184 ERROR nova.compute.manager [instance: 4fd6c244-238c-4e75-a856-3713163f4d17] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 4386, in _unshelve_instance
nova-compute.log:2019-10-31 20:31:48.216 680184 ERROR nova.compute.manager [instance: 4fd6c244-238c-4e75-a856-3713163f4d17] block_device_info=block_device_info)
nova-compute.log:2019-10-31 20:31:48.216 680184 ERROR nova.compute.manager [instance: 4fd6c244-238c-4e75-a856-3713163f4d17] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 2742, in spawn
nova-compute.log:2019-10-31 20:31:48.216 680184 ERROR nova.compute.manager [instance: 4fd6c244-238c-4e75-a856-3713163f4d17] destroy_disks_on_failure=True)
nova-compute.log:2019-10-31 20:31:48.216 680184 ERROR nova.compute.manager [instance: 4fd6c244-238c-4e75-a856-3713163f4d17] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 5121, in _create_domain_and_network
nova-compute.log:2019-10-31 20:31:48.216 680184 ERROR nova.compute.manager [instance: 4fd6c244-238c-4e75-a856-3713163f4d17] destroy_disks_on_failure)
nova-compute.log:2019-10-31 20:31:48.216 680184 ERROR nova.compute.manager [instance: 4fd6c244-238c-4e75-a856-3713163f4d17] File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
nova-compute.log:2019-10-31 20:31:48.216 680184 ERROR nova.compute.manager [instance: 4fd6c244-238c-4e75-a856-3713163f4d17] self.force_reraise()
nova-compute.log:2019-10-31 20:31:48.216 680184 ERROR nova.compute.manager [instance: 4fd6c244-238c-4e75-a856-3713163f4d17] File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
nova-compute.log:2019-10-31 20:31:48.216 680184 ERROR nova.compute.manager [instance: 4fd6c244-238c-4e75-a856-3713163f4d17] six.reraise(self.type_, self.value, self.tb)
nova-compute.log:2019-10-31 20:31:48.216 680184 ERROR nova.compute.manager [instance: 4fd6c244-238c-4e75-a856-3713163f4d17] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 5093, in _create_domain_and_network
nova-compute.log:2019-10-31 20:31:48.216 680184 ERROR nova.compute.manager [instance: 4fd6c244-238c-4e75-a856-3713163f4d17] post_xml_callback=post_xml_callback)
nova-compute.log:2019-10-31 20:31:48.216 680184 ERROR nova.compute.manager [instance: 4fd6c244-238c-4e75-a856-3713163f4d17] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 5011, in _create_domain
nova-compute.log:2019-10-31 20:31:48.216 680184 ERROR nova.compute.manager [instance: 4fd6c244-238c-4e75-a856-3713163f4d17] guest.launch(pause=pause)
nova-compute.log:2019-10-31 20:31:48.216 680184 ERROR nova.compute.manager [instance: 4fd6c244-238c-4e75-a856-3713163f4d17] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 144, in launch
nova-compute.log:2019-10-31 20:31:48.216 680184 ERROR nova.compute.manager [instance: 4fd6c244-238c-4e75-a856-3713163f4d17] self._encoded_xml, errors='ignore')
nova-compute.log:2019-10-31 20:31:48.216 680184 ERROR nova.compute.manager [instance: 4fd6c244-238c-4e75-a856-3713163f4d17] File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
nova-compute.log:2019-10-31 20:31:48.216 680184 ERROR nova.compute.manager [instance: 4fd6c244-238c-4e75-a856-3713163f4d17] self.force_reraise()
nova-compute.log:2019-10-31 20:31:48.216 680184 ERROR nova.compute.manager [instance: 4fd6c244-238c-4e75-a856-3713163f4d17] File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
nova-compute.log:2019-10-31 20:31:48.216 680184 ERROR nova.compute.manager [instance: 4fd6c244-238c-4e75-a856-3713163f4d17] six.reraise(self.type_, self.value, self.tb)
nova-compute.log:2019-10-31 20:31:48.216 680184 ERROR nova.compute.manager [instance: 4fd6c244-238c-4e75-a856-3713163f4d17] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 139, in launch
nova-compute.log:2019-10-31 20:31:48.216 680184 ERROR nova.compute.manager [instance: 4fd6c244-238c-4e75-a856-3713163f4d17] return self._domain.createWithFlags(flags)
nova-compute.log:2019-10-31 20:31:48.216 680184 ERROR nova.compute.manager [instance: 4fd6c244-238c-4e75-a856-3713163f4d17] File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 186, in doit
nova-compute.log:2019-10-31 20:31:48.216 680184 ERROR nova.compute.manager [instance: 4fd6c244-238c-4e75-a856-3713163f4d17] result = proxy_call(self._autowrap, f, *args, **kwargs)
nova-compute.log:2019-10-31 20:31:48.216 680184 ERROR nova.compute.manager [instance: 4fd6c244-238c-4e75-a856-3713163f4d17] File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 144, in proxy_call
nova-compute.log:2019-10-31 20:31:48.216 680184 ERROR nova.compute.manager [instance: 4fd6c244-238c-4e75-a856-3713163f4d17] rv = execute(f, *args, **kwargs)
nova-compute.log:2019-10-31 20:31:48.216 680184 ERROR nova.compute.manager [instance: 4fd6c244-238c-4e75-a856-3713163f4d17] File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 125, in execute
nova-compute.log:2019-10-31 20:31:48.216 680184 ERROR nova.compute.manager [instance: 4fd6c244-238c-4e75-a856-3713163f4d17] six.reraise(c, e, tb)
nova-compute.log:2019-10-31 20:31:48.216 680184 ERROR nova.compute.manager [instance: 4fd6c244-238c-4e75-a856-3713163f4d17] File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 83, in tworker
nova-compute.log:2019-10-31 20:31:48.216 680184 ERROR nova.compute.manager [instance: 4fd6c244-238c-4e75-a856-3713163f4d17] rv = meth(*args, **kwargs)
nova-compute.log:2019-10-31 20:31:48.216 680184 ERROR nova.compute.manager [instance: 4fd6c244-238c-4e75-a856-3713163f4d17] File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1099, in createWithFlags
nova-compute.log:2019-10-31 20:31:48.216 680184 ERROR nova.compute.manager [instance: 4fd6c244-238c-4e75-a856-3713163f4d17] if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
nova-compute.log:2019-10-31 20:31:48.216 680184 ERROR nova.compute.manager [instance: 4fd6c244-238c-4e75-a856-3713163f4d17] libvirtError: Requested operation is not valid: PCI device 0000:5d:17.6 is in use by driver QEMU, domain instance-000024b0
~~~

Revision history for this message
David Vallee Delisle (valleedelisle) wrote :
Revision history for this message
sean mooney (sean-k-mooney) wrote :

not this is a long standing bug that affect Queens so it is not a new regression.
as a result i am triaging this as medium rather then high.
i have looked at this downstream a few months ago when it was first found and
my first impression is that we have a sequencing issue where we are updating the port binding profile
after we have generated the xml on teh dest host or we are using the wrong pci adress in the port profile when
we bind the port to the dest host.

when we unshelve with a neuront sriov device the port binding profile will initally still contain the pci address of the device that was used on the source host in the pci_slot key in the neutron port binding profile. when we generate the xml we us that address for the port pci device. when we unshelve we claim a pci device in the pci manager but we apparently are not updating the port bining correctly.

we have a similar but different bug affecting live and colder migration with neutorn direct-phyical port as we also dont correclty handel the fact that the mac adress will change on migration. i suspect that might also be the case for unshevle but i have not tried it if i get time to repoduce this i will try testing for that edgecase.

Changed in nova:
importance: Undecided → Medium
status: New → Triaged
tags: added: network neutron pci sriov
tags: added: shelve
Changed in nova:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/783084
Committed: https://opendev.org/openstack/nova/commit/bea06123dbba82851fd17a41cc93ab4a519f8bfe
Submitter: "Zuul (22348)"
Branch: master

commit bea06123dbba82851fd17a41cc93ab4a519f8bfe
Author: Artom Lifshitz <email address hidden>
Date: Thu Mar 25 15:21:59 2021 -0400

    Test SRIOV port move operations with PCI conflicts

    This patch tests cold migration, unshelve and evacuate in a situation
    where the existing port binding's pci_slot would cause a conflict on
    the destination compute node. While cold migration and evacuation
    work correctly, in the unshelve case the pci_slot is not updated,
    resulting in two instances attempting to consume the same PCI device.
    This "passed" in the functional tests, but with a real libvirt this
    would obviously explode.

    Related: bug 1851545

    Change-Id: Ib81532dc1e6dd85822e38eb1785ffb7162d2a84d

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/784168
Committed: https://opendev.org/openstack/nova/commit/00f1d4757e503bb9807d7a8d7035c061a97db983
Submitter: "Zuul (22348)"
Branch: master

commit 00f1d4757e503bb9807d7a8d7035c061a97db983
Author: Artom Lifshitz <email address hidden>
Date: Wed Mar 31 16:57:35 2021 -0400

    Update SRIOV port pci_slot when unshelving

    There are a few things we need to do to make that work:

    * Always set the PCIRequest's requester_id. Previously, this was only
      done for resource requests. The requester_id is the port UUID, so we
      can use that to correlate which port to update with which pci_slot
      (in the case of multiple SRIOV ports per instance).

      This has the side effect of making the fix work only for instances
      created *after* this patch has been applied. It's not ideal, but
      there does not appear to be a better way.

    * Call setup_networks_on_host() within the instance_claim context.
      This means the instance's pci_devices are updated when we call it,
      allowing us to get the pci_slot information from them.

    With the two previous changes in place, we can figure out the port's
    new pci_slot in _update_port_binding_for_instance().

    Closes: bug 1851545
    Change-Id: Icfa8c1d6e84eab758af6223a2870078685584aaa

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/wallaby)

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/790710

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/790711

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/nova/+/790710
Committed: https://opendev.org/openstack/nova/commit/3625d5336a220ead291335eeb06bc1be7e32a21a
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 3625d5336a220ead291335eeb06bc1be7e32a21a
Author: Artom Lifshitz <email address hidden>
Date: Thu Mar 25 15:21:59 2021 -0400

    Test SRIOV port move operations with PCI conflicts

    This patch tests cold migration, unshelve and evacuate in a situation
    where the existing port binding's pci_slot would cause a conflict on
    the destination compute node. While cold migration and evacuation
    work correctly, in the unshelve case the pci_slot is not updated,
    resulting in two instances attempting to consume the same PCI device.
    This "passed" in the functional tests, but with a real libvirt this
    would obviously explode.

    Related: bug 1851545

    Change-Id: Ib81532dc1e6dd85822e38eb1785ffb7162d2a84d
    (cherry picked from commit bea06123dbba82851fd17a41cc93ab4a519f8bfe)

tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/victoria)

Related fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/796908

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/796909

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/nova/+/790711
Committed: https://opendev.org/openstack/nova/commit/bf7254b794f2296cdb701a21abb9e5708c951542
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit bf7254b794f2296cdb701a21abb9e5708c951542
Author: Artom Lifshitz <email address hidden>
Date: Wed Mar 31 16:57:35 2021 -0400

    Update SRIOV port pci_slot when unshelving

    There are a few things we need to do to make that work:

    * Always set the PCIRequest's requester_id. Previously, this was only
      done for resource requests. The requester_id is the port UUID, so we
      can use that to correlate which port to update with which pci_slot
      (in the case of multiple SRIOV ports per instance).

      This has the side effect of making the fix work only for instances
      created *after* this patch has been applied. It's not ideal, but
      there does not appear to be a better way.

    * Call setup_networks_on_host() within the instance_claim context.
      This means the instance's pci_devices are updated when we call it,
      allowing us to get the pci_slot information from them.

    With the two previous changes in place, we can figure out the port's
    new pci_slot in _update_port_binding_for_instance().

    Closes: bug 1851545
    Change-Id: Icfa8c1d6e84eab758af6223a2870078685584aaa
    (cherry picked from commit 00f1d4757e503bb9807d7a8d7035c061a97db983)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/victoria)

Change abandoned by "Artom Lifshitz <email address hidden>" on branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/796908

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Artom Lifshitz <email address hidden>" on branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/796909

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.