Nova fails to SRIOV VM with error "libvirtError: Requested operation is not valid: PCI device 0000:5e:05.6 is in use by driver QEMU, domain instance-....."

Bug #2003253 reported by Radoslaw Smigielski
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Expired
Undecided
Unassigned

Bug Description

On two SRIOV computes, with Mellanox ConextX-5 NIC, we can create SRIOV VMs with no problems.
When we create several of these SRIOV VMs and start live migrate these VMs at some point we hit below error:

2023-01-17 08:09:04.413 7 INFO nova.virt.libvirt.driver [req-f128d0fc-fab7-43e0-b5c3-7d039ed3122c 7280f3f5a7cd430f9ab5310b3e8acb27 6e24c3394ab14ec2823d991ff3bd4371 - default default] Attaching vif 26ab618c-186b-402e-b8d1-0c0f9e57d8cf to instance 37
2023-01-17 08:09:04.433 7 ERROR nova.virt.libvirt.driver [req-f128d0fc-fab7-43e0-b5c3-7d039ed3122c 7280f3f5a7cd430f9ab5310b3e8acb27 6e24c3394ab14ec2823d991ff3bd4371 - default default] [instance: dc84de60-274b-4694-b73b-9aa237d9561b] attaching network adapter failed.: libvirtError: Requested operation is not valid: PC
I device 0000:5e:05.6 is in use by driver QEMU, domain instance-0000002e
2023-01-17 08:09:04.433 7 ERROR nova.virt.libvirt.driver [instance: dc84de60-274b-4694-b73b-9aa237d9561b] Traceback (most recent call last):
2023-01-17 08:09:04.433 7 ERROR nova.virt.libvirt.driver [instance: dc84de60-274b-4694-b73b-9aa237d9561b] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 2139, in attach_interface
2023-01-17 08:09:04.433 7 ERROR nova.virt.libvirt.driver [instance: dc84de60-274b-4694-b73b-9aa237d9561b] guest.attach_device(cfg, persistent=True, live=live)
2023-01-17 08:09:04.433 7 ERROR nova.virt.libvirt.driver [instance: dc84de60-274b-4694-b73b-9aa237d9561b] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 305, in attach_device
2023-01-17 08:09:04.433 7 ERROR nova.virt.libvirt.driver [instance: dc84de60-274b-4694-b73b-9aa237d9561b] self._domain.attachDeviceFlags(device_xml, flags=flags)
2023-01-17 08:09:04.433 7 ERROR nova.virt.libvirt.driver [instance: dc84de60-274b-4694-b73b-9aa237d9561b] File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 190, in doit
2023-01-17 08:09:04.433 7 ERROR nova.virt.libvirt.driver [instance: dc84de60-274b-4694-b73b-9aa237d9561b] result = proxy_call(self._autowrap, f, *args, **kwargs)
2023-01-17 08:09:04.433 7 ERROR nova.virt.libvirt.driver [instance: dc84de60-274b-4694-b73b-9aa237d9561b] File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 148, in proxy_call
2023-01-17 08:09:04.433 7 ERROR nova.virt.libvirt.driver [instance: dc84de60-274b-4694-b73b-9aa237d9561b] rv = execute(f, *args, **kwargs)
2023-01-17 08:09:04.433 7 ERROR nova.virt.libvirt.driver [instance: dc84de60-274b-4694-b73b-9aa237d9561b] File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 129, in execute
2023-01-17 08:09:04.433 7 ERROR nova.virt.libvirt.driver [instance: dc84de60-274b-4694-b73b-9aa237d9561b] six.reraise(c, e, tb)
2023-01-17 08:09:04.433 7 ERROR nova.virt.libvirt.driver [instance: dc84de60-274b-4694-b73b-9aa237d9561b] File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 83, in tworker
2023-01-17 08:09:04.433 7 ERROR nova.virt.libvirt.driver [instance: dc84de60-274b-4694-b73b-9aa237d9561b] rv = meth(*args, **kwargs)
2023-01-17 08:09:04.433 7 ERROR nova.virt.libvirt.driver [instance: dc84de60-274b-4694-b73b-9aa237d9561b] File "/usr/lib64/python2.7/site-packages/libvirt.py", line 605, in attachDeviceFlags
2023-01-17 08:09:04.433 7 ERROR nova.virt.libvirt.driver [instance: dc84de60-274b-4694-b73b-9aa237d9561b] if ret == -1: raise libvirtError ('virDomainAttachDeviceFlags() failed', dom=self)
2023-01-17 08:09:04.433 7 ERROR nova.virt.libvirt.driver [instance: dc84de60-274b-4694-b73b-9aa237d9561b] libvirtError: Requested operation is not valid: PCI device 0000:5e:05.6 is in use by driver QEMU, domain instance-0000002e
2023-01-17 08:09:04.433 7 ERROR nova.virt.libvirt.driver [instance: dc84de60-274b-4694-b73b-9aa237d9561b]
2023-01-17 08:09:04.437 7 ERROR nova.compute.manager [req-f128d0fc-fab7-43e0-b5c3-7d039ed3122c 7280f3f5a7cd430f9ab5310b3e8acb27 6e24c3394ab14ec2823d991ff3bd4371 - default default] [instance: dc84de60-274b-4694-b73b-9aa237d9561b] Unexpected error during post live migration at destination host.: InterfaceAttachFailed:
Failed to attach network adapter device to dc84de60-274b-4694-b73b-9aa237d9561b

What seems to be happening is that on two different SRIOV computes, virtual functions with the same PCI address are in used by two VMs. When we migrate one of the VMs to the second compute where the same PCI virtual function is used we run into above PCI device conflict.

The end result is pretty bad, on the target compute
1. There is still running the VM which was originally running there
2. Migrated VM, the libvirt domain is running but it does not have the NIC based on virtual function connected, no connectivity

In general the problem seems to be that Nova does not check for the PCI devices (SRIOV virtual functions) to be unique across all SRIOV capable computes and more than one VM can get PCI devices with the same, conflicting address.

There is another bug with similar problem https://bugs.launchpad.net/nova/+bug/1633120 but it seems to be different problem.

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

What OpenStack version are you using?
What are the exact commands you use to i) create the VM ii) execute the live migration?

Nova should allocate a PCI device on the target host independently of which device was used on the source host during live migration.[1]

Marking this as INCOMPLETE until the questions are answered.

[1] https://github.com/openstack/nova/blob/d8b4b7bebdc0f55353cd99f372044b9e30315a6d/nova/compute/manager.py#L8336-L8340

Changed in nova:
status: New → Incomplete
tags: added: live-migration pci
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.