OpenStack Compute (nova)

Bug #1696125
Comment #16

Comment 16 for bug 1696125

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-06-13: Re: Detach interface failed - Unable to detach from guest transient domain (pike)

#16

Looking at the libvirtd logs, I see the net1 device being added shortly before the detach:

2017-06-12 22:19:28.525+0000: 30400: debug : virThreadJobSet:96 : Thread 30400 (virNetServerHandleJob) is now running job remoteDispatchDomainAttachDeviceFlags
...
2017-06-12 22:19:28.623+0000: 30400: debug : virDomainPCIAddressGetNextSlot:761 : PCI slot 0000:00:01 already in use
2017-06-12 22:19:28.623+0000: 30400: debug : virDomainPCIAddressGetNextSlot:761 : PCI slot 0000:00:02 already in use
2017-06-12 22:19:28.623+0000: 30400: debug : virDomainPCIAddressGetNextSlot:761 : PCI slot 0000:00:03 already in use
2017-06-12 22:19:28.623+0000: 30400: debug : virDomainPCIAddressGetNextSlot:761 : PCI slot 0000:00:04 already in use
2017-06-12 22:19:28.623+0000: 30400: debug : virDomainPCIAddressGetNextSlot:761 : PCI slot 0000:00:05 already in use
2017-06-12 22:19:28.623+0000: 30400: debug : virDomainPCIAddressGetNextSlot:810 : Found free PCI slot 0000:00:06
2017-06-12 22:19:28.623+0000: 30400: debug : virDomainPCIAddressReserveAddr:574 : Reserving PCI slot 0000:00:06.0 (multifunction='off')
...
2017-06-12 22:19:28.625+0000: 30400: debug : qemuMonitorAddDeviceWithFd:3063 : device=virtio-net-pci,netdev=hostnet1,id=net1,mac=fa:16:3e:3b:f6:dc,bus=pci.0,addr=0x6 fd=-1 fdname=<null>
...
2017-06-12 22:19:28.625+0000: 30400: debug : qemuMonitorJSONCommandWithFd:296 : Send command '{"execute":"device_add","arguments":{"driver":"virtio-net-pci","netdev":"hostnet1","id":"net1","mac":"fa:16:3e:3b:f6:dc","bus":"pci.0","addr":"0x6"},"id":"libvirt-19"}' for write with FD -1

Then we start detaching here:

2017-06-12 22:19:31.201+0000: 30402: debug : virThreadJobSet:96 : Thread 30402 (virNetServerHandleJob) is now running job remoteDispatchDomainDetachDeviceFlags
2017-06-12 22:19:31.201+0000: 30402: debug : qemuDomainObjBeginJobInternal:3268 : Starting job: modify (vm=0x7fa8cc003350 name=instance-00000077, current job=none async=none)
...
2017-06-12 22:19:31.246+0000: 30402: debug : qemuMonitorJSONCommandWithFd:296 : Send command '{"execute":"device_del","arguments":{"id":"net1"},"id":"libvirt-21"}' for write with FD -1
...
2017-06-12 22:19:36.250+0000: 30402: debug : virThreadJobClear:121 : Thread 30402 (virNetServerHandleJob) finished job remoteDispatchDomainDetachDeviceFlags with ret=0

And then right after that there is a another detach job it looks like:

2017-06-12 22:19:36.255+0000: 30401: debug : virThreadJobSet:96 : Thread 30401 (virNetServerHandleJob) is now running job remoteDispatchDomainDetachDeviceFlags
2017-06-12 22:19:36.255+0000: 30401: debug : qemuDomainObjBeginJobInternal:3268 : Starting job: modify (vm=0x7fa8cc003350 name=instance-00000077, current job=none async=none)
2017-06-12 22:19:36.255+0000: 30401: debug : qemuDomainObjBeginJobInternal:3309 : Started job: modify (async=none vm=0x7fa8cc003350 name=instance-00000077)
...
2017-06-12 22:19:36.257+0000: 30401: debug : qemuMonitorJSONCommandWithFd:296 : Send command '{"execute":"device_del","arguments":{"id":"net1"},"id":"libvirt-23"}' for write with FD -1

That could be from when we start the retry loop doing detaches on what we think is the transient domain.

Eventually I finally see this:

2017-06-12 22:24:17.842+0000: 30399: debug : qemuProcessHandleDeviceDeleted:1351 : Device net1 removed from domain 0x7fa8cc003350 instance-00000077
...
2017-06-12 22:24:17.842+0000: 1366: debug : processDeviceDeletedEvent:4074 : Removing device net1 from domain 0x7fa8cc003350 instance-00000077
...
2017-06-12 22:24:17.842+0000: 1366: debug : qemuDomainRemoveNetDevice:3844 : Removing network interface net1 from domain 0x7fa8cc003350 instance-00000077

Shortly after that, the destroy command comes in to shutdown and destroy the guest:

2017-06-12 22:24:33.335+0000: 30402: debug : qemuDomainObjBeginJobInternal:3268 : Starting job: destroy (vm=0x7fa8cc003350 name=instance-00000077, current job=none async=none)

So before that, it looks like it took ~5 seconds for the net1 device to be removed. And we timed out our retry loop at 22:21:12.373317.

Our initial detach from nova was at:

Jun 12 22:19:31.200463 ubuntu-xenial-internap-mtl01-9270284 nova-compute[23597]: DEBUG nova.virt.libvirt.guest [None req-fd8ab969-7349-4740-89de-dd5621ee2154 tempest-TestNetworkBasicOps-1239760647 tempest-TestNetworkBasicOps-1239760647] Attempting initial detach for device tapb2bdb0c0-62 {{(pid=23597) detach_device_with_retry /opt/stack/new/nova/nova/virt/libvirt/guest.py:433}}

Which aligns with the libvirtd log:

2017-06-12 22:19:31.201+0000: 30402: debug : virThreadJobSet:96 : Thread 30402 (virNetServerHandleJob) is now running job remoteDispatchDomainDetachDeviceFlags

And net1 isn't removed until 2017-06-12 22:24:17.842+0000, but we timed out at:

Jun 12 22:21:12.373317 ubuntu-xenial-internap-mtl01-9270284 nova-compute[23597]: WARNING nova.virt.libvirt.driver [None req-fd8ab969-7349-4740-89de-dd5621ee2154 tempest-TestNetworkBasicOps-1239760647 tempest-TestNetworkBasicOps-1239760647] [instance: fb351e95-745f-4c40-bb58-d122353e35f4] Failed to detach interface tapb2bdb0c0-62 after repeated attempts. Final interface xml:

So nova retried and waited for ~2 minutes, but it took ~5 minutes for the device to be gone.

This is the code nova is using:

https://github.com/openstack/nova/blob/a2dd78580ec5bcc7a73924cde46a53afe2c09348/nova/virt/libvirt/guest.py#L374

Calling the RetryDecorator:

http://git.openstack.org/cgit/openstack/oslo.service/tree/oslo_service/loopingcall.py#n345

Looking at that, it should retry 7 times with an incremental sleep starting at 2, so 2 + 4 + 6 + 8 + 10 + 12 + 14 = ~58 seconds total sleep time.

So it really looks to be a problem with the underlying type of device and/or something with libvirt/qemu that it takes so long for the device to actually be removed from the guest.