Looking at that, it should retry 7 times with an incremental sleep starting at 2, so 2 + 4 + 6 + 8 + 10 + 12 + 14 = ~58 seconds total sleep time.
So it really looks to be a problem with the underlying type of device and/or something with libvirt/qemu that it takes so long for the device to actually be removed from the guest.
Looking at the libvirtd logs, I see the net1 device being added shortly before the detach:
2017-06-12 22:19:28.525+0000: 30400: debug : virThreadJobSet:96 : Thread 30400 (virNetServerHa ndleJob) is now running job remoteDispatchD omainAttachDevi ceFlags ressGetNextSlot :761 : PCI slot 0000:00:01 already in use ressGetNextSlot :761 : PCI slot 0000:00:02 already in use ressGetNextSlot :761 : PCI slot 0000:00:03 already in use ressGetNextSlot :761 : PCI slot 0000:00:04 already in use ressGetNextSlot :761 : PCI slot 0000:00:05 already in use ressGetNextSlot :810 : Found free PCI slot 0000:00:06 ressReserveAddr :574 : Reserving PCI slot 0000:00:06.0 (multifunction= 'off') eviceWithFd: 3063 : device= virtio- net-pci, netdev= hostnet1, id=net1, mac=fa: 16:3e:3b: f6:dc,bus= pci.0,addr= 0x6 fd=-1 fdname=<null> CommandWithFd: 296 : Send command '{"execute" :"device_ add","arguments ":{"driver" :"virtio- net-pci" ,"netdev" :"hostnet1" ,"id":" net1"," mac":"fa: 16:3e:3b: f6:dc", "bus":" pci.0", "addr": "0x6"}, "id":"libvirt- 19"}' for write with FD -1
...
2017-06-12 22:19:28.623+0000: 30400: debug : virDomainPCIAdd
2017-06-12 22:19:28.623+0000: 30400: debug : virDomainPCIAdd
2017-06-12 22:19:28.623+0000: 30400: debug : virDomainPCIAdd
2017-06-12 22:19:28.623+0000: 30400: debug : virDomainPCIAdd
2017-06-12 22:19:28.623+0000: 30400: debug : virDomainPCIAdd
2017-06-12 22:19:28.623+0000: 30400: debug : virDomainPCIAdd
2017-06-12 22:19:28.623+0000: 30400: debug : virDomainPCIAdd
...
2017-06-12 22:19:28.625+0000: 30400: debug : qemuMonitorAddD
...
2017-06-12 22:19:28.625+0000: 30400: debug : qemuMonitorJSON
--
Then we start detaching here:
2017-06-12 22:19:31.201+0000: 30402: debug : virThreadJobSet:96 : Thread 30402 (virNetServerHa ndleJob) is now running job remoteDispatchD omainDetachDevi ceFlags ginJobInternal: 3268 : Starting job: modify (vm=0x7fa8cc003350 name=instance- 00000077, current job=none async=none) CommandWithFd: 296 : Send command '{"execute" :"device_ del","arguments ":{"id" :"net1" },"id": "libvirt- 21"}' for write with FD -1 ar:121 : Thread 30402 (virNetServerHa ndleJob) finished job remoteDispatchD omainDetachDevi ceFlags with ret=0
2017-06-12 22:19:31.201+0000: 30402: debug : qemuDomainObjBe
...
2017-06-12 22:19:31.246+0000: 30402: debug : qemuMonitorJSON
...
2017-06-12 22:19:36.250+0000: 30402: debug : virThreadJobCle
And then right after that there is a another detach job it looks like:
2017-06-12 22:19:36.255+0000: 30401: debug : virThreadJobSet:96 : Thread 30401 (virNetServerHa ndleJob) is now running job remoteDispatchD omainDetachDevi ceFlags ginJobInternal: 3268 : Starting job: modify (vm=0x7fa8cc003350 name=instance- 00000077, current job=none async=none) ginJobInternal: 3309 : Started job: modify (async=none vm=0x7fa8cc003350 name=instance- 00000077) CommandWithFd: 296 : Send command '{"execute" :"device_ del","arguments ":{"id" :"net1" },"id": "libvirt- 23"}' for write with FD -1
2017-06-12 22:19:36.255+0000: 30401: debug : qemuDomainObjBe
2017-06-12 22:19:36.255+0000: 30401: debug : qemuDomainObjBe
...
2017-06-12 22:19:36.257+0000: 30401: debug : qemuMonitorJSON
That could be from when we start the retry loop doing detaches on what we think is the transient domain.
Eventually I finally see this:
2017-06-12 22:24:17.842+0000: 30399: debug : qemuProcessHand leDeviceDeleted :1351 : Device net1 removed from domain 0x7fa8cc003350 instance-00000077 letedEvent: 4074 : Removing device net1 from domain 0x7fa8cc003350 instance-00000077 eNetDevice: 3844 : Removing network interface net1 from domain 0x7fa8cc003350 instance-00000077
...
2017-06-12 22:24:17.842+0000: 1366: debug : processDeviceDe
...
2017-06-12 22:24:17.842+0000: 1366: debug : qemuDomainRemov
Shortly after that, the destroy command comes in to shutdown and destroy the guest:
2017-06-12 22:24:33.335+0000: 30402: debug : qemuDomainObjBe ginJobInternal: 3268 : Starting job: destroy (vm=0x7fa8cc003350 name=instance- 00000077, current job=none async=none)
So before that, it looks like it took ~5 seconds for the net1 device to be removed. And we timed out our retry loop at 22:21:12.373317.
Our initial detach from nova was at:
Jun 12 22:19:31.200463 ubuntu- xenial- internap- mtl01-9270284 nova-compute[ 23597]: DEBUG nova.virt. libvirt. guest [None req-fd8ab969- 7349-4740- 89de-dd5621ee21 54 tempest- TestNetworkBasi cOps-1239760647 tempest- TestNetworkBasi cOps-1239760647 ] Attempting initial detach for device tapb2bdb0c0-62 {{(pid=23597) detach_ device_ with_retry /opt/stack/ new/nova/ nova/virt/ libvirt/ guest.py: 433}}
Which aligns with the libvirtd log:
2017-06-12 22:19:31.201+0000: 30402: debug : virThreadJobSet:96 : Thread 30402 (virNetServerHa ndleJob) is now running job remoteDispatchD omainDetachDevi ceFlags
And net1 isn't removed until 2017-06-12 22:24:17.842+0000, but we timed out at:
Jun 12 22:21:12.373317 ubuntu- xenial- internap- mtl01-9270284 nova-compute[ 23597]: WARNING nova.virt. libvirt. driver [None req-fd8ab969- 7349-4740- 89de-dd5621ee21 54 tempest- TestNetworkBasi cOps-1239760647 tempest- TestNetworkBasi cOps-1239760647 ] [instance: fb351e95- 745f-4c40- bb58-d122353e35 f4] Failed to detach interface tapb2bdb0c0-62 after repeated attempts. Final interface xml:
So nova retried and waited for ~2 minutes, but it took ~5 minutes for the device to be gone.
This is the code nova is using:
https:/ /github. com/openstack/ nova/blob/ a2dd78580ec5bcc 7a73924cde46a53 afe2c09348/ nova/virt/ libvirt/ guest.py# L374
Calling the RetryDecorator:
http:// git.openstack. org/cgit/ openstack/ oslo.service/ tree/oslo_ service/ loopingcall. py#n345
Looking at that, it should retry 7 times with an incremental sleep starting at 2, so 2 + 4 + 6 + 8 + 10 + 12 + 14 = ~58 seconds total sleep time.
So it really looks to be a problem with the underlying type of device and/or something with libvirt/qemu that it takes so long for the device to actually be removed from the guest.