On two SRIOV computes, with Mellanox ConextX-5 NIC, we can create SRIOV VMs with no problems.
When we create several of these SRIOV VMs and start live migrate these VMs at some point we hit below error:
2023-01-17 08:09:04.413 7 INFO nova.virt.libvirt.driver [req-f128d0fc-fab7-43e0-b5c3-7d039ed3122c 7280f3f5a7cd430f9ab5310b3e8acb27 6e24c3394ab14ec2823d991ff3bd4371 - default default] Attaching vif 26ab618c-186b-402e-b8d1-0c0f9e57d8cf to instance 37
2023-01-17 08:09:04.433 7 ERROR nova.virt.libvirt.driver [req-f128d0fc-fab7-43e0-b5c3-7d039ed3122c 7280f3f5a7cd430f9ab5310b3e8acb27 6e24c3394ab14ec2823d991ff3bd4371 - default default] [instance: dc84de60-274b-4694-b73b-9aa237d9561b] attaching network adapter failed.: libvirtError: Requested operation is not valid: PC
I device 0000:5e:05.6 is in use by driver QEMU, domain instance-0000002e
2023-01-17 08:09:04.433 7 ERROR nova.virt.libvirt.driver [instance: dc84de60-274b-4694-b73b-9aa237d9561b] Traceback (most recent call last):
2023-01-17 08:09:04.433 7 ERROR nova.virt.libvirt.driver [instance: dc84de60-274b-4694-b73b-9aa237d9561b] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 2139, in attach_interface
2023-01-17 08:09:04.433 7 ERROR nova.virt.libvirt.driver [instance: dc84de60-274b-4694-b73b-9aa237d9561b] guest.attach_device(cfg, persistent=True, live=live)
2023-01-17 08:09:04.433 7 ERROR nova.virt.libvirt.driver [instance: dc84de60-274b-4694-b73b-9aa237d9561b] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 305, in attach_device
2023-01-17 08:09:04.433 7 ERROR nova.virt.libvirt.driver [instance: dc84de60-274b-4694-b73b-9aa237d9561b] self._domain.attachDeviceFlags(device_xml, flags=flags)
2023-01-17 08:09:04.433 7 ERROR nova.virt.libvirt.driver [instance: dc84de60-274b-4694-b73b-9aa237d9561b] File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 190, in doit
2023-01-17 08:09:04.433 7 ERROR nova.virt.libvirt.driver [instance: dc84de60-274b-4694-b73b-9aa237d9561b] result = proxy_call(self._autowrap, f, *args, **kwargs)
2023-01-17 08:09:04.433 7 ERROR nova.virt.libvirt.driver [instance: dc84de60-274b-4694-b73b-9aa237d9561b] File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 148, in proxy_call
2023-01-17 08:09:04.433 7 ERROR nova.virt.libvirt.driver [instance: dc84de60-274b-4694-b73b-9aa237d9561b] rv = execute(f, *args, **kwargs)
2023-01-17 08:09:04.433 7 ERROR nova.virt.libvirt.driver [instance: dc84de60-274b-4694-b73b-9aa237d9561b] File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 129, in execute
2023-01-17 08:09:04.433 7 ERROR nova.virt.libvirt.driver [instance: dc84de60-274b-4694-b73b-9aa237d9561b] six.reraise(c, e, tb)
2023-01-17 08:09:04.433 7 ERROR nova.virt.libvirt.driver [instance: dc84de60-274b-4694-b73b-9aa237d9561b] File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 83, in tworker
2023-01-17 08:09:04.433 7 ERROR nova.virt.libvirt.driver [instance: dc84de60-274b-4694-b73b-9aa237d9561b] rv = meth(*args, **kwargs)
2023-01-17 08:09:04.433 7 ERROR nova.virt.libvirt.driver [instance: dc84de60-274b-4694-b73b-9aa237d9561b] File "/usr/lib64/python2.7/site-packages/libvirt.py", line 605, in attachDeviceFlags
2023-01-17 08:09:04.433 7 ERROR nova.virt.libvirt.driver [instance: dc84de60-274b-4694-b73b-9aa237d9561b] if ret == -1: raise libvirtError ('virDomainAttachDeviceFlags() failed', dom=self)
2023-01-17 08:09:04.433 7 ERROR nova.virt.libvirt.driver [instance: dc84de60-274b-4694-b73b-9aa237d9561b] libvirtError: Requested operation is not valid: PCI device 0000:5e:05.6 is in use by driver QEMU, domain instance-0000002e
2023-01-17 08:09:04.433 7 ERROR nova.virt.libvirt.driver [instance: dc84de60-274b-4694-b73b-9aa237d9561b]
2023-01-17 08:09:04.437 7 ERROR nova.compute.manager [req-f128d0fc-fab7-43e0-b5c3-7d039ed3122c 7280f3f5a7cd430f9ab5310b3e8acb27 6e24c3394ab14ec2823d991ff3bd4371 - default default] [instance: dc84de60-274b-4694-b73b-9aa237d9561b] Unexpected error during post live migration at destination host.: InterfaceAttachFailed:
Failed to attach network adapter device to dc84de60-274b-4694-b73b-9aa237d9561b
What seems to be happening is that on two different SRIOV computes, virtual functions with the same PCI address are in used by two VMs. When we migrate one of the VMs to the second compute where the same PCI virtual function is used we run into above PCI device conflict.
The end result is pretty bad, on the target compute
1. There is still running the VM which was originally running there
2. Migrated VM, the libvirt domain is running but it does not have the NIC based on virtual function connected, no connectivity
In general the problem seems to be that Nova does not check for the PCI devices (SRIOV virtual functions) to be unique across all SRIOV capable computes and more than one VM can get PCI devices with the same, conflicting address.
There is another bug with similar problem https://bugs.launchpad.net/nova/+bug/1633120 but it seems to be different problem.
What OpenStack version are you using?
What are the exact commands you use to i) create the VM ii) execute the live migration?
Nova should allocate a PCI device on the target host independently of which device was used on the source host during live migration.[1]
Marking this as INCOMPLETE until the questions are answered.
[1] https:/ /github. com/openstack/ nova/blob/ d8b4b7bebdc0f55 353cd99f372044b 9e30315a6d/ nova/compute/ manager. py#L8336- L8340