Bug #1696125 “Detach interface failed - timeout waiting to detac...” : Bugs : OpenStack Compute (nova)

Matt Riedemann (mriedem) on 2017-06-06

Changed in nova:
status:	New → Confirmed
importance:	Undecided → Medium

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-06-06:

#1

Impacting CI runs so bumping to High severity.

Changed in nova:
importance:	Medium → High

OpenStack Infra (hudson-openstack) on 2017-06-06

Changed in nova:
assignee:	nobody → Matt Riedemann (mriedem)
status:	Confirmed → In Progress

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-06-06:

#2

https://review.openstack.org/#/c/441204/ should address this.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-06-07: Fix merged to nova (master)

#3

Reviewed: https://review.openstack.org/441204
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=563c0927d14d052e3f1fad80df95fe4a7c48d38b
Submitter: Jenkins
Branch: master

commit 563c0927d14d052e3f1fad80df95fe4a7c48d38b
Author: Matt Riedemann <email address hidden>
Date: Tue Jun 6 14:08:37 2017 -0400

libvirt: Check if domain is persistent before detaching devices

    Previously the libvirt driver would always assume that it was only
    detaching devices (volumes or virtual interfaces) from a persistent
    domain however that is not always the case.

    For example when rolling back from a live migration an attempt is made
    to detach volumes from the transient destination domain that is being
    cleaned up. This attempt would fail with the previous assumption of the
    domain being persistent in place.

    This change introduces a simple call to has_persistent_configuration
    within detach_device_with_retry to confirm the state of the domain
    before attempting to detach.

    Closes-Bug: #1669857
    Closes-Bug: #1696125
    Change-Id: I95948721a0119f5f54dbe50d4455fd47d422164b

Changed in nova:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-06-08: Fix proposed to nova (stable/ocata)

#4

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/472309

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-06-08: Fix proposed to nova (stable/newton)

#5

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/472345

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-06-08: Re: Detach interface failed - Unable to detach from guest transient domain (pike)

#6

Apparently this is still failing on master even though the fix was merged:

http://status.openstack.org/elastic-recheck/#1696125

It's pretty much all in the gate-tempest-dsvm-neutron-linuxbridge-ubuntu-xenial job and this test:

tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_hotplug_nic[compute,id-c5adff73-e961-41f1-b4a9-343614f18cfa,network]

Ihar pointed out this Neutron change for how linux bridge vifs are plugged:

https://review.openstack.org/#/c/447150/

Maybe there is something there with how nova thinks things are going to be down when unplugging the vif that is the wrong assumption? I'm not sure why libvirt is considering it a transient domain though.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-06-08:

#7

https://wiki.libvirt.org/page/VM_lifecycle#Transient_guest_domains_vs_Persistent_guest_domains

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-06-08: Related fix proposed to nova (master)

#8

Related fix proposed to branch: master
Review: https://review.openstack.org/472371

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-06-08: Re: Detach interface failed - Unable to detach from guest transient domain (pike)

#9

Neutron revert of the LB tap type change:

https://review.openstack.org/#/c/472365/

Just to see if that is part of the issue.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-06-08: Related fix proposed to nova (master)

#10

Related fix proposed to branch: master
Review: https://review.openstack.org/472377

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-06-08: Re: Detach interface failed - Unable to detach from guest transient domain (pike)

#11

Running debug on the nova side using:

https://review.openstack.org/#/c/471494/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-06-08: Fix included in openstack/nova 16.0.0.0b2

#12

This issue was fixed in the openstack/nova 16.0.0.0b2 development milestone.

Dan Smith (danms) on 2017-06-09

Changed in nova:
status:	Fix Released → Confirmed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-06-12: Related fix merged to nova (master)

#13

Reviewed: https://review.openstack.org/472371
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=fabfb347f51afbe9064a6d0061404316786140cb
Submitter: Jenkins
Branch: master

commit fabfb347f51afbe9064a6d0061404316786140cb
Author: Matt Riedemann <email address hidden>
Date: Thu Jun 8 14:28:50 2017 -0400

libvirt: fix alternative_device_name for detaching interfaces

There are two changes here:

    1. Get the actual vif device name for logging, e.g. tap3e81295f-c1
    2. Log that in detach_device_with_retry so we're not logging an
       LibvirtConfigGuestInterface object.

Change-Id: I7588c17a9936eb765269c662cad4cbedc5f58fbf
Related-Bug: #1696125

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-06-13: Re: Detach interface failed - Unable to detach from guest transient domain (pike)

#14

Got a hit in the neutron debug patch:

http://logs.openstack.org/94/471494/2/check/gate-tempest-dsvm-neutron-linuxbridge-ubuntu-xenial/e0332b9/logs/screen-n-cpu.txt.gz#_Jun_12_22_21_12_373317

http://paste.openstack.org/show/612460/

So when this fails, we're looking for this:

And within the guest I see:

This is what it's comparing against:

https://github.com/openstack/nova/blob/a2dd78580ec5bcc7a73924cde46a53afe2c09348/nova/virt/libvirt/guest.py#L251-L255

if (interface.mac_addr == cfg.mac_addr and
       interface.net_type == cfg.net_type and
       interface.source_dev == cfg.source_dev and
       interface.target_dev == cfg.target_dev and
       interface.vhostuser_path == cfg.vhostuser_path):
return interface

Which is going to match on net_type, mac_addr and target_dev since source_dev and vhostuser_path are both None.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-06-13:

#15

This is the qemu guest log but there isn't much there:

http://logs.openstack.org/94/471494/2/check/gate-tempest-dsvm-neutron-linuxbridge-ubuntu-xenial/e0332b9/logs/libvirt/qemu/instance-00000077.txt.gz

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-06-13:

#16

Download full text (5.7 KiB)

Looking at the libvirtd logs, I see the net1 device being added shortly before the detach:

2017-06-12 22:19:28.525+0000: 30400: debug : virThreadJobSet:96 : Thread 30400 (virNetServerHandleJob) is now running job remoteDispatchDomainAttachDeviceFlags
...
2017-06-12 22:19:28.623+0000: 30400: debug : virDomainPCIAddressGetNextSlot:761 : PCI slot 0000:00:01 already in use
2017-06-12 22:19:28.623+0000: 30400: debug : virDomainPCIAddressGetNextSlot:761 : PCI slot 0000:00:02 already in use
2017-06-12 22:19:28.623+0000: 30400: debug : virDomainPCIAddressGetNextSlot:761 : PCI slot 0000:00:03 already in use
2017-06-12 22:19:28.623+0000: 30400: debug : virDomainPCIAddressGetNextSlot:761 : PCI slot 0000:00:04 already in use
2017-06-12 22:19:28.623+0000: 30400: debug : virDomainPCIAddressGetNextSlot:761 : PCI slot 0000:00:05 already in use
2017-06-12 22:19:28.623+0000: 30400: debug : virDomainPCIAddressGetNextSlot:810 : Found free PCI slot 0000:00:06
2017-06-12 22:19:28.623+0000: 30400: debug : virDomainPCIAddressReserveAddr:574 : Reserving PCI slot 0000:00:06.0 (multifunction='off')
...
2017-06-12 22:19:28.625+0000: 30400: debug : qemuMonitorAddDeviceWithFd:3063 : device=virtio-net-pci,netdev=hostnet1,id=net1,mac=fa:16:3e:3b:f6:dc,bus=pci.0,addr=0x6 fd=-1 fdname=<null>
...
2017-06-12 22:19:28.625+0000: 30400: debug : qemuMonitorJSONCommandWithFd:296 : Send command '{"execute":"device_add","arguments":{"driver":"virtio-net-pci","netdev":"hostnet1","id":"net1","mac":"fa:16:3e:3b:f6:dc","bus":"pci.0","addr":"0x6"},"id":"libvirt-19"}' for write with FD -1

--

Then we start detaching here:

2017-06-12 22:19:31.201+0000: 30402: debug : virThreadJobSet:96 : Thread 30402 (virNetServerHandleJob) is now running job remoteDispatchDomainDetachDeviceFlags
2017-06-12 22:19:31.201+0000: 30402: debug : qemuDomainObjBeginJobInternal:3268 : Starting job: modify (vm=0x7fa8cc003350 name=instance-00000077, current job=none async=none)
...
2017-06-12 22:19:31.246+0000: 30402: debug : qemuMonitorJSONCommandWithFd:296 : Send command '{"execute":"device_del","arguments":{"id":"net1"},"id":"libvirt-21"}' for write with FD -1
...
2017-06-12 22:19:36.250+0000: 30402: debug : virThreadJobClear:121 : Thread 30402 (virNetServerHandleJob) finished job remoteDispatchDomainDetachDeviceFlags with ret=0

And then right after that there is a another detach job it looks like:

2017-06-12 22:19:36.255+0000: 30401: debug : virThreadJobSet:96 : Thread 30401 (virNetServerHandleJob) is now running job remoteDispatchDomainDetachDeviceFlags
2017-06-12 22:19:36.255+0000: 30401: debug : qemuDomainObjBeginJobInternal:3268 : Starting job: modify (vm=0x7fa8cc003350 name=instance-00000077, current job=none async=none)
2017-06-12 22:19:36.255+0000: 30401: debug : qemuDomainObjBeginJobInternal:3309 : Started job: modify (async=none vm=0x7fa8cc003350 name=instance-00000077)
...
2017-06-12 22:19:36.257+0000: 30401: debug : qemuMonitorJSONCommandWithFd:296 : Send command '{"execute":"device_del","arguments":{"id":"net1"},"id":"libvirt-23"}' for write with FD -1

That could be from when we start the retry loop doing detaches on what we think is the transient domain.

Eventually I finally see this:

2017-06-...

Looking at the libvirtd logs, I see the net1 device being added shortly before the detach:

2017-06-12 22:19:28.525+0000: 30400: debug : virThreadJobSet:96 : Thread 30400 (virNetServerHandleJob) is now running job remoteDispatchDomainAttachDeviceFlags
...
2017-06-12 22:19:28.623+0000: 30400: debug : virDomainPCIAddressGetNextSlot:761 : PCI slot 0000:00:01 already in use
2017-06-12 22:19:28.623+0000: 30400: debug : virDomainPCIAddressGetNextSlot:761 : PCI slot 0000:00:02 already in use
2017-06-12 22:19:28.623+0000: 30400: debug : virDomainPCIAddressGetNextSlot:761 : PCI slot 0000:00:03 already in use
2017-06-12 22:19:28.623+0000: 30400: debug : virDomainPCIAddressGetNextSlot:761 : PCI slot 0000:00:04 already in use
2017-06-12 22:19:28.623+0000: 30400: debug : virDomainPCIAddressGetNextSlot:761 : PCI slot 0000:00:05 already in use
2017-06-12 22:19:28.623+0000: 30400: debug : virDomainPCIAddressGetNextSlot:810 : Found free PCI slot 0000:00:06
2017-06-12 22:19:28.623+0000: 30400: debug : virDomainPCIAddressReserveAddr:574 : Reserving PCI slot 0000:00:06.0 (multifunction='off')
...
2017-06-12 22:19:28.625+0000: 30400: debug : qemuMonitorAddDeviceWithFd:3063 : device=virtio-net-pci,netdev=hostnet1,id=net1,mac=fa:16:3e:3b:f6:dc,bus=pci.0,addr=0x6 fd=-1 fdname=<null>
...
2017-06-12 22:19:28.625+0000: 30400: debug : qemuMonitorJSONCommandWithFd:296 : Send command '{"execute":"device_add","arguments":{"driver":"virtio-net-pci","netdev":"hostnet1","id":"net1","mac":"fa:16:3e:3b:f6:dc","bus":"pci.0","addr":"0x6"},"id":"libvirt-19"}' for write with FD -1

--

Then we start detaching here:

2017-06-12 22:19:31.201+0000: 30402: debug : virThreadJobSet:96 : Thread 30402 (virNetServerHandleJob) is now running job remoteDispatchDomainDetachDeviceFlags
2017-06-12 22:19:31.201+0000: 30402: debug : qemuDomainObjBeginJobInternal:3268 : Starting job: modify (vm=0x7fa8cc003350 name=instance-00000077, current job=none async=none)
...
2017-06-12 22:19:31.246+0000: 30402: debug : qemuMonitorJSONCommandWithFd:296 : Send command '{"execute":"device_del","arguments":{"id":"net1"},"id":"libvirt-21"}' for write with FD -1
...
2017-06-12 22:19:36.250+0000: 30402: debug : virThreadJobClear:121 : Thread 30402 (virNetServerHandleJob) finished job remoteDispatchDomainDetachDeviceFlags with ret=0

And then right after that there is a another detach job it looks like:

2017-06-12 22:19:36.255+0000: 30401: debug : virThreadJobSet:96 : Thread 30401 (virNetServerHandleJob) is now running job remoteDispatchDomainDetachDeviceFlags
2017-06-12 22:19:36.255+0000: 30401: debug : qemuDomainObjBeginJobInternal:3268 : Starting job: modify (vm=0x7fa8cc003350 name=instance-00000077, current job=none async=none)
2017-06-12 22:19:36.255+0000: 30401: debug : qemuDomainObjBeginJobInternal:3309 : Started job: modify (async=none vm=0x7fa8cc003350 name=instance-00000077)
...
2017-06-12 22:19:36.257+0000: 30401: debug : qemuMonitorJSONCommandWithFd:296 : Send command '{"execute":"device_del","arguments":{"id":"net1"},"id":"libvirt-23"}' for write with FD -1

That could be from when we start the retry loop doing detaches on what we think is the transient domain.

Eventually I finally see this:

2017-06-12 22:24:17.842+0000: 30399: debug : qemuProcessHandleDeviceDeleted:1351 : Device net1 removed from domain 0x7fa8cc003350 instance-00000077
...
2017-06-12 22:24:17.842+0000: 1366: debug : processDeviceDeletedEvent:4074 : Removing device net1 from domain 0x7fa8cc003350 instance-00000077
...
2017-06-12 22:24:17.842+0000: 1366: debug : qemuDomainRemoveNetDevice:3844 : Removing network interface net1 from domain 0x7fa8cc003350 instance-00000077

Shortly after that, the destroy command comes in to shutdown and destroy the guest:

2017-06-12 22:24:33.335+0000: 30402: debug : qemuDomainObjBeginJobInternal:3268 : Starting job: destroy (vm=0x7fa8cc003350 name=instance-00000077, current job=none async=none)

So before that, it looks like it took ~5 seconds for the net1 device to be removed. And we timed out our retry loop at 22:21:12.373317.

Our initial detach from nova was at:

Jun 12 22:19:31.200463 ubuntu-xenial-internap-mtl01-9270284 nova-compute[23597]: DEBUG nova.virt.libvirt.guest [None req-fd8ab969-7349-4740-89de-dd5621ee2154 tempest-TestNetworkBasicOps-1239760647 tempest-TestNetworkBasicOps-1239760647] Attempting initial detach for device tapb2bdb0c0-62 {{(pid=23597) detach_device_with_retry /opt/stack/new/nova/nova/virt/libvirt/guest.py:433}}

Which aligns with the libvirtd log:

2017-06-12 22:19:31.201+0000: 30402: debug : virThreadJobSet:96 : Thread 30402 (virNetServerHandleJob) is now running job remoteDispatchDomainDetachDeviceFlags

And net1 isn't removed until 2017-06-12 22:24:17.842+0000, but we timed out at:

Jun 12 22:21:12.373317 ubuntu-xenial-internap-mtl01-9270284 nova-compute[23597]: WARNING nova.virt.libvirt.driver [None req-fd8ab969-7349-4740-89de-dd5621ee2154 tempest-TestNetworkBasicOps-1239760647 tempest-TestNetworkBasicOps-1239760647] [instance: fb351e95-745f-4c40-bb58-d122353e35f4] Failed to detach interface tapb2bdb0c0-62 after repeated attempts. Final interface xml:

So nova retried and waited for ~2 minutes, but it took ~5 minutes for the device to be gone.

This is the code nova is using:

https://github.com/openstack/nova/blob/a2dd78580ec5bcc7a73924cde46a53afe2c09348/nova/virt/libvirt/guest.py#L374

Calling the RetryDecorator:

http://git.openstack.org/cgit/openstack/oslo.service/tree/oslo_service/loopingcall.py#n345

Looking at that, it should retry 7 times with an incremental sleep starting at 2, so 2 + 4 + 6 + 8 + 10 + 12 + 14 = ~58 seconds total sleep time.

So it really looks to be a problem with the underlying type of device and/or something with libvirt/qemu that it takes so long for the device to actually be removed from the guest.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-06-13:

#17

Looking in the neutron L2 agent logs, I see the tap device is attached/added here:

http://logs.openstack.org/94/471494/2/check/gate-tempest-dsvm-neutron-linuxbridge-ubuntu-xenial/e0332b9/logs/screen-q-agt.txt.gz#_Jun_12_22_19_30_156456

Jun 12 22:19:30.156456 ubuntu-xenial-internap-mtl01-9270284 neutron-linuxbridge-agent[18735]: DEBUG neutron.plugins.ml2.drivers.agent._common_agent [None req-2b21118e-0da6-4c4a-a9d5-5d0e2cf4ccf8 None None] Port tapb2bdb0c0-62 added {{(pid=18735) _process_device_if_exists /opt/stack/new/neutron/neutron/plugins/ml2/drivers/agent/_common_agent.py:236}}
...
Jun 12 22:19:30.240016 ubuntu-xenial-internap-mtl01-9270284 neutron-linuxbridge-agent[18735]: DEBUG neutron.plugins.ml2.drivers.linuxbridge.agent.linuxbridge_neutron_agent [None req-2b21118e-0da6-4c4a-a9d5-5d0e2cf4ccf8 None None] Adding device tapb2bdb0c0-62 to bridge brq8bb2148d-ee {{(pid=18735) _add_tap_interface /opt/stack/new/neutron/neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py:532}}

Then it's removed here:

Jun 12 22:19:31.745781 ubuntu-xenial-internap-mtl01-9270284 neutron-linuxbridge-agent[18735]: DEBUG neutron.plugins.ml2.drivers.agent._common_agent [None req-2b21118e-0da6-4c4a-a9d5-5d0e2cf4ccf8 None None] Agent loop found changes! {'current': set(['tap3578cb80-7a', 'tape6c02d6c-e4', 'tap8b6dc92f-75', 'tapb7c1b8db-1e', 'tap999bcb19-81', 'tap839063fe-6d', 'tap3a6b0b5c-cb', 'tap14bcf3bb-af', 'tap0f5cf6b7-63']), 'timestamps': {'tap3578cb80-7a': 9, 'tape6c02d6c-e4': 483, 'tap8b6dc92f-75': 6, 'tapb7c1b8db-1e': 10, 'tap999bcb19-81': 478, 'tap839063fe-6d': 484, 'tap3a6b0b5c-cb': 480, 'tap14bcf3bb-af': 479, 'tap0f5cf6b7-63': 11}, 'removed': set(['tapb2bdb0c0-62']), 'added': set([]), 'updated': set([])} {{(pid=18735) daemon_loop /opt/stack/new/neutron/neutron/plugins/ml2/drivers/agent/_common_agent.py:452}}
...
Jun 12 22:19:31.840782 ubuntu-xenial-internap-mtl01-9270284 neutron-linuxbridge-agent[18735]: INFO neutron.agent.securitygroups_rpc [None req-2b21118e-0da6-4c4a-a9d5-5d0e2cf4ccf8 None None] Remove device filter for set(['tapb2bdb0c0-62'])
...
Jun 12 22:19:31.901327 ubuntu-xenial-internap-mtl01-9270284 neutron-linuxbridge-agent[18735]: INFO neutron.plugins.ml2.drivers.agent._common_agent [None req-2b21118e-0da6-4c4a-a9d5-5d0e2cf4ccf8 None None] Attachment tapb2bdb0c0-62 removed

Looking in the neutron L2 agent logs, I see the tap device is attached/added here:

http://logs.openstack.org/94/471494/2/check/gate-tempest-dsvm-neutron-linuxbridge-ubuntu-xenial/e0332b9/logs/screen-q-agt.txt.gz#_Jun_12_22_19_30_156456

Jun 12 22:19:30.156456 ubuntu-xenial-internap-mtl01-9270284 neutron-linuxbridge-agent[18735]: DEBUG neutron.plugins.ml2.drivers.agent._common_agent [None req-2b21118e-0da6-4c4a-a9d5-5d0e2cf4ccf8 None None] Port tapb2bdb0c0-62 added {{(pid=18735) _process_device_if_exists /opt/stack/new/neutron/neutron/plugins/ml2/drivers/agent/_common_agent.py:236}}
...
Jun 12 22:19:30.240016 ubuntu-xenial-internap-mtl01-9270284 neutron-linuxbridge-agent[18735]: DEBUG neutron.plugins.ml2.drivers.linuxbridge.agent.linuxbridge_neutron_agent [None req-2b21118e-0da6-4c4a-a9d5-5d0e2cf4ccf8 None None] Adding device tapb2bdb0c0-62 to bridge brq8bb2148d-ee {{(pid=18735) _add_tap_interface /opt/stack/new/neutron/neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py:532}}

Then it's removed here:

Jun 12 22:19:31.745781 ubuntu-xenial-internap-mtl01-9270284 neutron-linuxbridge-agent[18735]: DEBUG neutron.plugins.ml2.drivers.agent._common_agent [None req-2b21118e-0da6-4c4a-a9d5-5d0e2cf4ccf8 None None] Agent loop found changes! {'current': set(['tap3578cb80-7a', 'tape6c02d6c-e4', 'tap8b6dc92f-75', 'tapb7c1b8db-1e', 'tap999bcb19-81', 'tap839063fe-6d', 'tap3a6b0b5c-cb', 'tap14bcf3bb-af', 'tap0f5cf6b7-63']), 'timestamps': {'tap3578cb80-7a': 9, 'tape6c02d6c-e4': 483, 'tap8b6dc92f-75': 6, 'tapb7c1b8db-1e': 10, 'tap999bcb19-81': 478, 'tap839063fe-6d': 484, 'tap3a6b0b5c-cb': 480, 'tap14bcf3bb-af': 479, 'tap0f5cf6b7-63': 11}, 'removed': set(['tapb2bdb0c0-62']), 'added': set([]), 'updated': set([])} {{(pid=18735) daemon_loop /opt/stack/new/neutron/neutron/plugins/ml2/drivers/agent/_common_agent.py:452}}
...
Jun 12 22:19:31.840782 ubuntu-xenial-internap-mtl01-9270284 neutron-linuxbridge-agent[18735]: INFO neutron.agent.securitygroups_rpc [None req-2b21118e-0da6-4c4a-a9d5-5d0e2cf4ccf8 None None] Remove device filter for set(['tapb2bdb0c0-62'])
...
Jun 12 22:19:31.901327 ubuntu-xenial-internap-mtl01-9270284 neutron-linuxbridge-agent[18735]: INFO neutron.plugins.ml2.drivers.agent._common_agent [None req-2b21118e-0da6-4c4a-a9d5-5d0e2cf4ccf8 None None] Attachment tapb2bdb0c0-62 removed

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-06-13:

#18

There is a weird 5 minute gap in the neutron L2 agent logs here:

http://logs.openstack.org/94/471494/2/check/gate-tempest-dsvm-neutron-linuxbridge-ubuntu-xenial/e0332b9/logs/screen-q-agt.txt.gz#_Jun_12_22_24_33_893133

Jun 12 22:19:32.086692 ubuntu-xenial-internap-mtl01-9270284 neutron-linuxbridge-agent[18735]: DEBUG oslo_concurrency.lockutils [None req-2b21118e-0da6-4c4a-a9d5-5d0e2cf4ccf8 None None] Lock "ebtables" released by "neutron.plugins.ml2.drivers.linuxbridge.agent.arp_protect.delete_arp_spoofing_protection" :: held 0.033s {{(pid=18735) inner /usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:282}}
Jun 12 22:24:33.893133 ubuntu-xenial-internap-mtl01-9270284 neutron-linuxbridge-agent[18735]: DEBUG neutron.plugins.ml2.drivers.agent._common_agent [None req-2b21118e-0da6-4c4a-a9d5-5d0e2cf4ccf8 None None] Agent loop found changes! {'current': set(['tap3578cb80-7a', 'tap8b6dc92f-75', 'tapb7c1b8db-1e', 'tap999bcb19-81', 'tap839063fe-6d', 'tap3a6b0b5c-cb', 'tap14bcf3bb-af', 'tap0f5cf6b7-63']), 'timestamps': {'tap3578cb80-7a': 9, 'tap8b6dc92f-75': 6, 'tapb7c1b8db-1e': 10, 'tap999bcb19-81': 478, 'tap839063fe-6d': 484, 'tap3a6b0b5c-cb': 480, 'tap14bcf3bb-af': 479, 'tap0f5cf6b7-63': 11}, 'removed': set(['tape6c02d6c-e4']), 'added': set([]), 'updated': set([])} {{(pid=18735) daemon_loop /opt/stack/new/neutron/neutron/plugins/ml2/drivers/agent/_common_agent.py:452}}

summary:

- Detach interface failed - Unable to detach from guest transient domain
- (pike)
+ Detach interface failed - timeout waiting to detach tap device in
+ linuxbridge job (pike)

Ihar Hrachyshka (ihar-hrachyshka) on 2017-06-13

tags:

added: neutron-proactive-backport-potential

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-06-13:

#19

I also wonder if we could be hitting some weird apparmor issues or something, see:

https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1665698

And https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1620407.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-06-13: Related fix merged to nova (master)

#20

Reviewed: https://review.openstack.org/472377
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a4e51d5c69b9dced99db0b0d78ef8b784cb50960
Submitter: Jenkins
Branch: master

commit a4e51d5c69b9dced99db0b0d78ef8b784cb50960
Author: Matt Riedemann <email address hidden>
Date: Thu Jun 8 14:52:08 2017 -0400

libvirt: dump debug info when interface detach times out

    The retry loop for detaching an interface is comparing the
    list of LibvirtConfigGuestInterfaces in the guest to the
    one we're trying to detach, so if we timeout waiting for
    the device to be gone from the guest it would be helpful
    to dump what is currently in the guest xml and the interface
    device we're looking for to compare.

Change-Id: Id20a53812dbb1f888d8debc964418e79dd3e0a30
Related-Bug: #1696125

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2017-06-19:

#21

It seems vif_type='tap' is just broken with libvirt so it's not safe to use. There isn't anything we can do on the Neutron side to fix this other than suggest plugins not use that vif_type until a fix is found in libvirt.

Changed in neutron:
status:	New → Opinion

Sean Dague (sdague) on 2017-06-23

Changed in nova:
status:	Confirmed → In Progress

Revision history for this message

Sean Dague (sdague) wrote on 2017-06-27:

#22

Automatically discovered version pike in description. If this is incorrect, please update the description to include 'nova version: ...'

tags:

added: openstack-version.pike

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-08-08: Related fix proposed to nova (stable/ocata)

#23

Related fix proposed to branch: stable/ocata
Review: https://review.openstack.org/491843

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-08-08: Related fix proposed to nova (stable/newton)

#24

Related fix proposed to branch: stable/newton
Review: https://review.openstack.org/491844

Jakub Libosvar (libosvar) on 2017-08-11

tags:

removed: neutron-proactive-backport-potential

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-08-13: Fix merged to nova (stable/ocata)

#25

Reviewed: https://review.openstack.org/472309
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=55e2c851f15d68e7891f867cc5f32eece9a95034
Submitter: Jenkins
Branch: stable/ocata

commit 55e2c851f15d68e7891f867cc5f32eece9a95034
Author: Lee Yarwood <email address hidden>
Date: Tue Jun 6 14:08:37 2017 -0400

libvirt: Check if domain is persistent before detaching devices

    Previously the libvirt driver would always assume that it was only
    detaching devices (volumes or virtual interfaces) from a persistent
    domain however that is not always the case.

    For example when rolling back from a live migration an attempt is made
    to detach volumes from the transient destination domain that is being
    cleaned up. This attempt would fail with the previous assumption of the
    domain being persistent in place.

    This change introduces a simple call to has_persistent_configuration
    within detach_device_with_retry to confirm the state of the domain
    before attempting to detach.

    Closes-Bug: #1669857
    Closes-Bug: #1696125
    Change-Id: I95948721a0119f5f54dbe50d4455fd47d422164b
    (cherry picked from commit 563c0927d14d052e3f1fad80df95fe4a7c48d38b)

tags:

added: in-stable-ocata

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-08-16: Related fix merged to nova (stable/ocata)

#26

Reviewed: https://review.openstack.org/491843
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c04c5d57cc41deacec53e18abe3ba93a1d5b5ccf
Submitter: Jenkins
Branch: stable/ocata

commit c04c5d57cc41deacec53e18abe3ba93a1d5b5ccf
Author: Matt Riedemann <email address hidden>
Date: Thu Jun 8 14:28:50 2017 -0400

libvirt: fix alternative_device_name for detaching interfaces

There are two changes here:

    1. Get the actual vif device name for logging, e.g. tap3e81295f-c1
    2. Log that in detach_device_with_retry so we're not logging an
       LibvirtConfigGuestInterface object.

    Change-Id: I7588c17a9936eb765269c662cad4cbedc5f58fbf
    Related-Bug: #1696125
    (cherry picked from commit fabfb347f51afbe9064a6d0061404316786140cb)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-08-16: Fix merged to nova (stable/newton)

#27

Reviewed: https://review.openstack.org/472345
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b2bafd09c3cde8cfd782a69aa8aa5dd53809d5bc
Submitter: Jenkins
Branch: stable/newton

commit b2bafd09c3cde8cfd782a69aa8aa5dd53809d5bc
Author: Lee Yarwood <email address hidden>
Date: Tue Jun 6 14:08:37 2017 -0400

libvirt: Check if domain is persistent before detaching devices

    Previously the libvirt driver would always assume that it was only
    detaching devices (volumes or virtual interfaces) from a persistent
    domain however that is not always the case.

    For example when rolling back from a live migration an attempt is made
    to detach volumes from the transient destination domain that is being
    cleaned up. This attempt would fail with the previous assumption of the
    domain being persistent in place.

    This change introduces a simple call to has_persistent_configuration
    within detach_device_with_retry to confirm the state of the domain
    before attempting to detach.

    Closes-Bug: #1669857
    Closes-Bug: #1696125
    Change-Id: I95948721a0119f5f54dbe50d4455fd47d422164b
    (cherry picked from commit 563c0927d14d052e3f1fad80df95fe4a7c48d38b)
    (cherry picked from commit 00b7b714b37dbbaac8a3570541ecc6b75ce3a77f)

tags:

added: in-stable-newton

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-08-16: Related fix merged to nova (stable/newton)

#28

Reviewed: https://review.openstack.org/491844
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2cf2b7ea6441c2eb73574a6e35ef10ae07550c6b
Submitter: Jenkins
Branch: stable/newton

commit 2cf2b7ea6441c2eb73574a6e35ef10ae07550c6b
Author: Matt Riedemann <email address hidden>
Date: Thu Jun 8 14:28:50 2017 -0400

libvirt: fix alternative_device_name for detaching interfaces

There are two changes here:

    1. Get the actual vif device name for logging, e.g. tap3e81295f-c1
    2. Log that in detach_device_with_retry so we're not logging an
       LibvirtConfigGuestInterface object.

    Change-Id: I7588c17a9936eb765269c662cad4cbedc5f58fbf
    Related-Bug: #1696125
    (cherry picked from commit fabfb347f51afbe9064a6d0061404316786140cb)
    (cherry picked from commit c04c5d57cc41deacec53e18abe3ba93a1d5b5ccf)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-08-22: Fix included in openstack/nova 15.0.7

#29

This issue was fixed in the openstack/nova 15.0.7 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-08-28: Fix included in openstack/nova 14.0.8

#30

This issue was fixed in the openstack/nova 14.0.8 release.

Revision history for this message

Steve Noyes (steve-noyes) wrote on 2017-09-21:

#31

While debugging the LiveMigrationTest.test_iscsi_volume tempest test, I am seeing this in my local env as one of the causes of intermittent failures with this test. I can get this to fail maybe 2-3x a day. This is what I run to provoke it:

$ tempest run --regex test_live_migration --concurrency=4

I haven't seen it fail on a single threaded test, but it's hard to say since it fails so infrequently.

The failure in n-cpu.log is:

2017-09-20 14:01:50.192 ^[[01;31mERROR oslo_messaging.rpc.server [^[[01;36mreq-6db9664a-4c6e-4d47-bffc-224b45b34eb0 ^[[00;36mtempest-LiveMigrationTest-1489892041 tempest-LiveMigrationTest-1489892041^[[01;31m] ^[[01;35m^[[01;31mException during message handling^[[00m: DeviceDetachFailed: Device detach failed for vdb: Unable to detach from guest transient domain.

The tempest error is:

tempest.lib.exceptions.TimeoutException: Request timed out
Details: volume 3194b6d9-cb70-4e67-8c28-ebf9ea094297 failed to reach ['available'] status (current in-use) within the required time (196 s).

I am adding more debug code to help debug this.

Revision history for this message

Steve Noyes (steve-noyes) wrote on 2017-09-21:

#32

Version of nova:

$ git show
commit e9daa28bae07fff7e617f03ef6b4956148db17f2
Merge: c1de4c7 10f8a9a
Author: Jenkins <email address hidden>
Date: Thu Aug 31 07:00:41 2017 +0000

Merge "[placement] Require at least one resource class in allocation"

Revision history for this message

Steve Noyes (steve-noyes) wrote on 2017-09-21:

#33

It doesn't look like a timeout issue. When the test is going to fail, you see these messages in the logs:

2017-09-19 14:13:01.172 DEBUG oslo.service.loopingcall [-] Invoking nova.virt.libvirt.guest._do_wait_and_retry_detach; retry count is 6. from (pid=611) _func /usr/local/lib/python2.7/dist-packages/oslo_service/loopingcall.py:393
...
2017-09-19 14:13:23.765 DEBUG oslo.service.loopingcall [-] Invoking nova.virt.libvirt.guest._do_wait_and_retry_detach; retry count is 7. from (pid=611) _func /usr/local/lib/python2.7/dist-packages/oslo_service/loopingcall.py:393

When the test passes, you don't see these messages, so if retries start happening, the detach will eventually fail.

Revision history for this message

melanie witt (melwitt) wrote on 2017-09-21:

#34

Something to know about detaching from a guest transient domain is that it's an asynchronous request to the guest OS [1], which the guest OS can choose to ignore (e.g. if it has a file open or is otherwise busy).

"Beware that depending on the hypervisor and device type, detaching a device from a running domain may be asynchronous. That is, calling virDomainDetachDeviceFlags may just request device removal while the device is actually removed later (in cooperation with a guest OS)."

So it seems like what you're seeing is the guest OS ignoring the request to detach for some reason and Nova keeps retrying and eventually gives up. The retry simply requests the detach again and then checks for existence of the device. If the device remains, it tries again, until the max retry count is reached.

[1] https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainDetachDeviceFlags

Revision history for this message

Steve Noyes (steve-noyes) wrote on 2017-09-28:

#35

Thanks for the comment. In this case, that shouldn't be the case as I don't think the volume was ever mounted by the instance's OS. From what I've seen cirros doesn't automount volumes that get attached. And if it's not mounted, I can't think of anything that the OS would do to prevent it from be detached. But I certainly could be missing something...

Revision history for this message

Dominic (abhi-02666) wrote on 2019-08-02:

#36

It is a great post. It's been such a wonderful experience while reading your blog. If facing any problem
related to a Linux please visit https://www.europewebsitedevelopment.com/ for any technical
support assistance

Lee Yarwood (lyarwood) on 2021-07-21

Changed in nova:
status:	In Progress → Fix Released

OpenStack Compute (nova)

Detach interface failed - timeout waiting to detach tap device in linuxbridge job (pike)

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Fix Released	High	Matt Riedemann
	neutron	Opinion	Undecided	Unassigned