Detach interface failed - timeout waiting to detach tap device in linuxbridge job (pike)

Bug #1696125 reported by Matt Riedemann on 2017-06-06
32
This bug affects 5 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Matt Riedemann
neutron
Undecided
Unassigned

Bug Description

Seeing this in Tempest runs on master (pike):

http://logs.openstack.org/24/471024/2/check/gate-tempest-dsvm-neutron-linuxbridge-ubuntu-xenial/6b98d38/logs/screen-n-cpu.txt.gz?level=TRACE#_Jun_06_02_16_02_855503

Jun 06 02:16:02.855503 ubuntu-xenial-ovh-bhs1-9149075 nova-compute[24118]: WARNING nova.compute.manager [None req-b4a50024-a2fd-4279-b284-340d2074f1c1 tempest-TestNetworkBasicOps-1479445685 tempest-TestNetworkBasicOps-1479445685] [instance: 2668bcb9-b13d-4b5b-8ee5-edbdee3b15a8] Detach interface failed, port_id=3843caa3-ab04-45f1-94d8-f330390e40fe, reason: Device detach failed for fa:16:3e:ab:e3:3f: Unable to detach from guest transient domain.: DeviceDetachFailed: Device detach failed for fa:16:3e:ab:e3:3f: Unable to detach from guest transient domain.
Jun 06 02:16:02.884007 ubuntu-xenial-ovh-bhs1-9149075 nova-compute[24118]: ERROR oslo_messaging.rpc.server [None req-b4a50024-a2fd-4279-b284-340d2074f1c1 tempest-TestNetworkBasicOps-1479445685 tempest-TestNetworkBasicOps-1479445685] Exception during message handling: InterfaceDetachFailed: Failed to detach network adapter device from 2668bcb9-b13d-4b5b-8ee5-edbdee3b15a8
Jun 06 02:16:02.884180 ubuntu-xenial-ovh-bhs1-9149075 nova-compute[24118]: ERROR oslo_messaging.rpc.server Traceback (most recent call last):
Jun 06 02:16:02.884286 ubuntu-xenial-ovh-bhs1-9149075 nova-compute[24118]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py", line 157, in _process_incoming
Jun 06 02:16:02.884395 ubuntu-xenial-ovh-bhs1-9149075 nova-compute[24118]: ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
Jun 06 02:16:02.884538 ubuntu-xenial-ovh-bhs1-9149075 nova-compute[24118]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 213, in dispatch
Jun 06 02:16:02.884669 ubuntu-xenial-ovh-bhs1-9149075 nova-compute[24118]: ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
Jun 06 02:16:02.884777 ubuntu-xenial-ovh-bhs1-9149075 nova-compute[24118]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 183, in _do_dispatch
Jun 06 02:16:02.884869 ubuntu-xenial-ovh-bhs1-9149075 nova-compute[24118]: ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
Jun 06 02:16:02.884968 ubuntu-xenial-ovh-bhs1-9149075 nova-compute[24118]: ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/exception_wrapper.py", line 77, in wrapped
Jun 06 02:16:02.885069 ubuntu-xenial-ovh-bhs1-9149075 nova-compute[24118]: ERROR oslo_messaging.rpc.server function_name, call_dict, binary)
Jun 06 02:16:02.885171 ubuntu-xenial-ovh-bhs1-9149075 nova-compute[24118]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
Jun 06 02:16:02.885272 ubuntu-xenial-ovh-bhs1-9149075 nova-compute[24118]: ERROR oslo_messaging.rpc.server self.force_reraise()
Jun 06 02:16:02.885367 ubuntu-xenial-ovh-bhs1-9149075 nova-compute[24118]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
Jun 06 02:16:02.885461 ubuntu-xenial-ovh-bhs1-9149075 nova-compute[24118]: ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
Jun 06 02:16:02.885554 ubuntu-xenial-ovh-bhs1-9149075 nova-compute[24118]: ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/exception_wrapper.py", line 68, in wrapped
Jun 06 02:16:02.885649 ubuntu-xenial-ovh-bhs1-9149075 nova-compute[24118]: ERROR oslo_messaging.rpc.server return f(self, context, *args, **kw)
Jun 06 02:16:02.885755 ubuntu-xenial-ovh-bhs1-9149075 nova-compute[24118]: ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/compute/manager.py", line 214, in decorated_function
Jun 06 02:16:02.885856 ubuntu-xenial-ovh-bhs1-9149075 nova-compute[24118]: ERROR oslo_messaging.rpc.server kwargs['instance'], e, sys.exc_info())
Jun 06 02:16:02.885950 ubuntu-xenial-ovh-bhs1-9149075 nova-compute[24118]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
Jun 06 02:16:02.886053 ubuntu-xenial-ovh-bhs1-9149075 nova-compute[24118]: ERROR oslo_messaging.rpc.server self.force_reraise()
Jun 06 02:16:02.886143 ubuntu-xenial-ovh-bhs1-9149075 nova-compute[24118]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
Jun 06 02:16:02.886232 ubuntu-xenial-ovh-bhs1-9149075 nova-compute[24118]: ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
Jun 06 02:16:02.886322 ubuntu-xenial-ovh-bhs1-9149075 nova-compute[24118]: ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/compute/manager.py", line 202, in decorated_function
Jun 06 02:16:02.886415 ubuntu-xenial-ovh-bhs1-9149075 nova-compute[24118]: ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
Jun 06 02:16:02.886505 ubuntu-xenial-ovh-bhs1-9149075 nova-compute[24118]: ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/compute/manager.py", line 5224, in detach_interface
Jun 06 02:16:02.886601 ubuntu-xenial-ovh-bhs1-9149075 nova-compute[24118]: ERROR oslo_messaging.rpc.server raise exception.InterfaceDetachFailed(instance_uuid=instance.uuid)
Jun 06 02:16:02.886695 ubuntu-xenial-ovh-bhs1-9149075 nova-compute[24118]: ERROR oslo_messaging.rpc.server InterfaceDetachFailed: Failed to detach network adapter device from 2668bcb9-b13d-4b5b-8ee5-edbdee3b15a8
Jun 06 02:16:02.886794 ubuntu-xenial-ovh-bhs1-9149075 nova-compute[24118]: ERROR oslo_messaging.rpc.server

This probably started with: https://review.openstack.org/#/c/349014/ (merged 5/31)

http://logstash.openstack.org/#/dashboard/file/logstash.json?query=message:%5C%22Detach%20interface%20failed%5C%22%20AND%20message:%5C%22reason:%20Device%20detach%20failed%20for%5C%22%20AND%20message:%5C%22Unable%20to%20detach%20from%20guest%20transient%20domain%5C%22%20AND%20tags:%5C%22screen-n-cpu.txt%5C%22&from=10d

Might need something like this: https://review.openstack.org/#/c/441204/

Matt Riedemann (mriedem) on 2017-06-06
Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
Matt Riedemann (mriedem) wrote :

Impacting CI runs so bumping to High severity.

Changed in nova:
importance: Medium → High
Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Confirmed → In Progress
Matt Riedemann (mriedem) wrote :

Reviewed: https://review.openstack.org/441204
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=563c0927d14d052e3f1fad80df95fe4a7c48d38b
Submitter: Jenkins
Branch: master

commit 563c0927d14d052e3f1fad80df95fe4a7c48d38b
Author: Matt Riedemann <email address hidden>
Date: Tue Jun 6 14:08:37 2017 -0400

    libvirt: Check if domain is persistent before detaching devices

    Previously the libvirt driver would always assume that it was only
    detaching devices (volumes or virtual interfaces) from a persistent
    domain however that is not always the case.

    For example when rolling back from a live migration an attempt is made
    to detach volumes from the transient destination domain that is being
    cleaned up. This attempt would fail with the previous assumption of the
    domain being persistent in place.

    This change introduces a simple call to has_persistent_configuration
    within detach_device_with_retry to confirm the state of the domain
    before attempting to detach.

    Closes-Bug: #1669857
    Closes-Bug: #1696125
    Change-Id: I95948721a0119f5f54dbe50d4455fd47d422164b

Changed in nova:
status: In Progress → Fix Released

Apparently this is still failing on master even though the fix was merged:

http://status.openstack.org/elastic-recheck/#1696125

It's pretty much all in the gate-tempest-dsvm-neutron-linuxbridge-ubuntu-xenial job and this test:

tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_hotplug_nic[compute,id-c5adff73-e961-41f1-b4a9-343614f18cfa,network]

Ihar pointed out this Neutron change for how linux bridge vifs are plugged:

https://review.openstack.org/#/c/447150/

Maybe there is something there with how nova thinks things are going to be down when unplugging the vif that is the wrong assumption? I'm not sure why libvirt is considering it a transient domain though.

Neutron revert of the LB tap type change:

https://review.openstack.org/#/c/472365/

Just to see if that is part of the issue.

This issue was fixed in the openstack/nova 16.0.0.0b2 development milestone.

Dan Smith (danms) on 2017-06-09
Changed in nova:
status: Fix Released → Confirmed

Reviewed: https://review.openstack.org/472371
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=fabfb347f51afbe9064a6d0061404316786140cb
Submitter: Jenkins
Branch: master

commit fabfb347f51afbe9064a6d0061404316786140cb
Author: Matt Riedemann <email address hidden>
Date: Thu Jun 8 14:28:50 2017 -0400

    libvirt: fix alternative_device_name for detaching interfaces

    There are two changes here:

    1. Get the actual vif device name for logging, e.g. tap3e81295f-c1
    2. Log that in detach_device_with_retry so we're not logging an
       LibvirtConfigGuestInterface object.

    Change-Id: I7588c17a9936eb765269c662cad4cbedc5f58fbf
    Related-Bug: #1696125

Got a hit in the neutron debug patch:

http://logs.openstack.org/94/471494/2/check/gate-tempest-dsvm-neutron-linuxbridge-ubuntu-xenial/e0332b9/logs/screen-n-cpu.txt.gz#_Jun_12_22_21_12_373317

http://paste.openstack.org/show/612460/

So when this fails, we're looking for this:

<interface type="ethernet">
  <mac address="fa:16:3e:3b:f6:dc"/>
  <model type="virtio"/>
  <driver name="qemu"/>
  <target dev="tapb2bdb0c0-62"/>
</interface>

And within the guest I see:

<interface type='ethernet'>
  <mac address='fa:16:3e:3b:f6:dc'/>
  <target dev='tapb2bdb0c0-62'/>
  <model type='virtio'/>
  <driver name='qemu'/>
  <alias name='net1'/>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
</interface>

This is what it's comparing against:

https://github.com/openstack/nova/blob/a2dd78580ec5bcc7a73924cde46a53afe2c09348/nova/virt/libvirt/guest.py#L251-L255

if (interface.mac_addr == cfg.mac_addr and
       interface.net_type == cfg.net_type and
       interface.source_dev == cfg.source_dev and
       interface.target_dev == cfg.target_dev and
       interface.vhostuser_path == cfg.vhostuser_path):
return interface

Which is going to match on net_type, mac_addr and target_dev since source_dev and vhostuser_path are both None.

Matt Riedemann (mriedem) wrote :
Download full text (5.7 KiB)

Looking at the libvirtd logs, I see the net1 device being added shortly before the detach:

2017-06-12 22:19:28.525+0000: 30400: debug : virThreadJobSet:96 : Thread 30400 (virNetServerHandleJob) is now running job remoteDispatchDomainAttachDeviceFlags
...
2017-06-12 22:19:28.623+0000: 30400: debug : virDomainPCIAddressGetNextSlot:761 : PCI slot 0000:00:01 already in use
2017-06-12 22:19:28.623+0000: 30400: debug : virDomainPCIAddressGetNextSlot:761 : PCI slot 0000:00:02 already in use
2017-06-12 22:19:28.623+0000: 30400: debug : virDomainPCIAddressGetNextSlot:761 : PCI slot 0000:00:03 already in use
2017-06-12 22:19:28.623+0000: 30400: debug : virDomainPCIAddressGetNextSlot:761 : PCI slot 0000:00:04 already in use
2017-06-12 22:19:28.623+0000: 30400: debug : virDomainPCIAddressGetNextSlot:761 : PCI slot 0000:00:05 already in use
2017-06-12 22:19:28.623+0000: 30400: debug : virDomainPCIAddressGetNextSlot:810 : Found free PCI slot 0000:00:06
2017-06-12 22:19:28.623+0000: 30400: debug : virDomainPCIAddressReserveAddr:574 : Reserving PCI slot 0000:00:06.0 (multifunction='off')
...
2017-06-12 22:19:28.625+0000: 30400: debug : qemuMonitorAddDeviceWithFd:3063 : device=virtio-net-pci,netdev=hostnet1,id=net1,mac=fa:16:3e:3b:f6:dc,bus=pci.0,addr=0x6 fd=-1 fdname=<null>
...
2017-06-12 22:19:28.625+0000: 30400: debug : qemuMonitorJSONCommandWithFd:296 : Send command '{"execute":"device_add","arguments":{"driver":"virtio-net-pci","netdev":"hostnet1","id":"net1","mac":"fa:16:3e:3b:f6:dc","bus":"pci.0","addr":"0x6"},"id":"libvirt-19"}' for write with FD -1

--

Then we start detaching here:

2017-06-12 22:19:31.201+0000: 30402: debug : virThreadJobSet:96 : Thread 30402 (virNetServerHandleJob) is now running job remoteDispatchDomainDetachDeviceFlags
2017-06-12 22:19:31.201+0000: 30402: debug : qemuDomainObjBeginJobInternal:3268 : Starting job: modify (vm=0x7fa8cc003350 name=instance-00000077, current job=none async=none)
...
2017-06-12 22:19:31.246+0000: 30402: debug : qemuMonitorJSONCommandWithFd:296 : Send command '{"execute":"device_del","arguments":{"id":"net1"},"id":"libvirt-21"}' for write with FD -1
...
2017-06-12 22:19:36.250+0000: 30402: debug : virThreadJobClear:121 : Thread 30402 (virNetServerHandleJob) finished job remoteDispatchDomainDetachDeviceFlags with ret=0

And then right after that there is a another detach job it looks like:

2017-06-12 22:19:36.255+0000: 30401: debug : virThreadJobSet:96 : Thread 30401 (virNetServerHandleJob) is now running job remoteDispatchDomainDetachDeviceFlags
2017-06-12 22:19:36.255+0000: 30401: debug : qemuDomainObjBeginJobInternal:3268 : Starting job: modify (vm=0x7fa8cc003350 name=instance-00000077, current job=none async=none)
2017-06-12 22:19:36.255+0000: 30401: debug : qemuDomainObjBeginJobInternal:3309 : Started job: modify (async=none vm=0x7fa8cc003350 name=instance-00000077)
...
2017-06-12 22:19:36.257+0000: 30401: debug : qemuMonitorJSONCommandWithFd:296 : Send command '{"execute":"device_del","arguments":{"id":"net1"},"id":"libvirt-23"}' for write with FD -1

That could be from when we start the retry loop doing detaches on what we think is the transient domain.

Eventually I finally see this:

2017-06-...

Read more...

Matt Riedemann (mriedem) wrote :

Looking in the neutron L2 agent logs, I see the tap device is attached/added here:

http://logs.openstack.org/94/471494/2/check/gate-tempest-dsvm-neutron-linuxbridge-ubuntu-xenial/e0332b9/logs/screen-q-agt.txt.gz#_Jun_12_22_19_30_156456

Jun 12 22:19:30.156456 ubuntu-xenial-internap-mtl01-9270284 neutron-linuxbridge-agent[18735]: DEBUG neutron.plugins.ml2.drivers.agent._common_agent [None req-2b21118e-0da6-4c4a-a9d5-5d0e2cf4ccf8 None None] Port tapb2bdb0c0-62 added {{(pid=18735) _process_device_if_exists /opt/stack/new/neutron/neutron/plugins/ml2/drivers/agent/_common_agent.py:236}}
...
Jun 12 22:19:30.240016 ubuntu-xenial-internap-mtl01-9270284 neutron-linuxbridge-agent[18735]: DEBUG neutron.plugins.ml2.drivers.linuxbridge.agent.linuxbridge_neutron_agent [None req-2b21118e-0da6-4c4a-a9d5-5d0e2cf4ccf8 None None] Adding device tapb2bdb0c0-62 to bridge brq8bb2148d-ee {{(pid=18735) _add_tap_interface /opt/stack/new/neutron/neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py:532}}

Then it's removed here:

Jun 12 22:19:31.745781 ubuntu-xenial-internap-mtl01-9270284 neutron-linuxbridge-agent[18735]: DEBUG neutron.plugins.ml2.drivers.agent._common_agent [None req-2b21118e-0da6-4c4a-a9d5-5d0e2cf4ccf8 None None] Agent loop found changes! {'current': set(['tap3578cb80-7a', 'tape6c02d6c-e4', 'tap8b6dc92f-75', 'tapb7c1b8db-1e', 'tap999bcb19-81', 'tap839063fe-6d', 'tap3a6b0b5c-cb', 'tap14bcf3bb-af', 'tap0f5cf6b7-63']), 'timestamps': {'tap3578cb80-7a': 9, 'tape6c02d6c-e4': 483, 'tap8b6dc92f-75': 6, 'tapb7c1b8db-1e': 10, 'tap999bcb19-81': 478, 'tap839063fe-6d': 484, 'tap3a6b0b5c-cb': 480, 'tap14bcf3bb-af': 479, 'tap0f5cf6b7-63': 11}, 'removed': set(['tapb2bdb0c0-62']), 'added': set([]), 'updated': set([])} {{(pid=18735) daemon_loop /opt/stack/new/neutron/neutron/plugins/ml2/drivers/agent/_common_agent.py:452}}
...
Jun 12 22:19:31.840782 ubuntu-xenial-internap-mtl01-9270284 neutron-linuxbridge-agent[18735]: INFO neutron.agent.securitygroups_rpc [None req-2b21118e-0da6-4c4a-a9d5-5d0e2cf4ccf8 None None] Remove device filter for set(['tapb2bdb0c0-62'])
...
Jun 12 22:19:31.901327 ubuntu-xenial-internap-mtl01-9270284 neutron-linuxbridge-agent[18735]: INFO neutron.plugins.ml2.drivers.agent._common_agent [None req-2b21118e-0da6-4c4a-a9d5-5d0e2cf4ccf8 None None] Attachment tapb2bdb0c0-62 removed

Matt Riedemann (mriedem) wrote :

There is a weird 5 minute gap in the neutron L2 agent logs here:

http://logs.openstack.org/94/471494/2/check/gate-tempest-dsvm-neutron-linuxbridge-ubuntu-xenial/e0332b9/logs/screen-q-agt.txt.gz#_Jun_12_22_24_33_893133

Jun 12 22:19:32.086692 ubuntu-xenial-internap-mtl01-9270284 neutron-linuxbridge-agent[18735]: DEBUG oslo_concurrency.lockutils [None req-2b21118e-0da6-4c4a-a9d5-5d0e2cf4ccf8 None None] Lock "ebtables" released by "neutron.plugins.ml2.drivers.linuxbridge.agent.arp_protect.delete_arp_spoofing_protection" :: held 0.033s {{(pid=18735) inner /usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:282}}
Jun 12 22:24:33.893133 ubuntu-xenial-internap-mtl01-9270284 neutron-linuxbridge-agent[18735]: DEBUG neutron.plugins.ml2.drivers.agent._common_agent [None req-2b21118e-0da6-4c4a-a9d5-5d0e2cf4ccf8 None None] Agent loop found changes! {'current': set(['tap3578cb80-7a', 'tap8b6dc92f-75', 'tapb7c1b8db-1e', 'tap999bcb19-81', 'tap839063fe-6d', 'tap3a6b0b5c-cb', 'tap14bcf3bb-af', 'tap0f5cf6b7-63']), 'timestamps': {'tap3578cb80-7a': 9, 'tap8b6dc92f-75': 6, 'tapb7c1b8db-1e': 10, 'tap999bcb19-81': 478, 'tap839063fe-6d': 484, 'tap3a6b0b5c-cb': 480, 'tap14bcf3bb-af': 479, 'tap0f5cf6b7-63': 11}, 'removed': set(['tape6c02d6c-e4']), 'added': set([]), 'updated': set([])} {{(pid=18735) daemon_loop /opt/stack/new/neutron/neutron/plugins/ml2/drivers/agent/_common_agent.py:452}}

summary: - Detach interface failed - Unable to detach from guest transient domain
- (pike)
+ Detach interface failed - timeout waiting to detach tap device in
+ linuxbridge job (pike)
tags: added: neutron-proactive-backport-potential
Matt Riedemann (mriedem) wrote :

I also wonder if we could be hitting some weird apparmor issues or something, see:

https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1665698

And https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1620407.

Reviewed: https://review.openstack.org/472377
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a4e51d5c69b9dced99db0b0d78ef8b784cb50960
Submitter: Jenkins
Branch: master

commit a4e51d5c69b9dced99db0b0d78ef8b784cb50960
Author: Matt Riedemann <email address hidden>
Date: Thu Jun 8 14:52:08 2017 -0400

    libvirt: dump debug info when interface detach times out

    The retry loop for detaching an interface is comparing the
    list of LibvirtConfigGuestInterfaces in the guest to the
    one we're trying to detach, so if we timeout waiting for
    the device to be gone from the guest it would be helpful
    to dump what is currently in the guest xml and the interface
    device we're looking for to compare.

    Change-Id: Id20a53812dbb1f888d8debc964418e79dd3e0a30
    Related-Bug: #1696125

Kevin Benton (kevinbenton) wrote :

It seems vif_type='tap' is just broken with libvirt so it's not safe to use. There isn't anything we can do on the Neutron side to fix this other than suggest plugins not use that vif_type until a fix is found in libvirt.

Changed in neutron:
status: New → Opinion
Sean Dague (sdague) on 2017-06-23
Changed in nova:
status: Confirmed → In Progress
Sean Dague (sdague) wrote :

Automatically discovered version pike in description. If this is incorrect, please update the description to include 'nova version: ...'

tags: added: openstack-version.pike
tags: removed: neutron-proactive-backport-potential

Reviewed: https://review.openstack.org/472309
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=55e2c851f15d68e7891f867cc5f32eece9a95034
Submitter: Jenkins
Branch: stable/ocata

commit 55e2c851f15d68e7891f867cc5f32eece9a95034
Author: Lee Yarwood <email address hidden>
Date: Tue Jun 6 14:08:37 2017 -0400

    libvirt: Check if domain is persistent before detaching devices

    Previously the libvirt driver would always assume that it was only
    detaching devices (volumes or virtual interfaces) from a persistent
    domain however that is not always the case.

    For example when rolling back from a live migration an attempt is made
    to detach volumes from the transient destination domain that is being
    cleaned up. This attempt would fail with the previous assumption of the
    domain being persistent in place.

    This change introduces a simple call to has_persistent_configuration
    within detach_device_with_retry to confirm the state of the domain
    before attempting to detach.

    Closes-Bug: #1669857
    Closes-Bug: #1696125
    Change-Id: I95948721a0119f5f54dbe50d4455fd47d422164b
    (cherry picked from commit 563c0927d14d052e3f1fad80df95fe4a7c48d38b)

tags: added: in-stable-ocata

Reviewed: https://review.openstack.org/491843
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c04c5d57cc41deacec53e18abe3ba93a1d5b5ccf
Submitter: Jenkins
Branch: stable/ocata

commit c04c5d57cc41deacec53e18abe3ba93a1d5b5ccf
Author: Matt Riedemann <email address hidden>
Date: Thu Jun 8 14:28:50 2017 -0400

    libvirt: fix alternative_device_name for detaching interfaces

    There are two changes here:

    1. Get the actual vif device name for logging, e.g. tap3e81295f-c1
    2. Log that in detach_device_with_retry so we're not logging an
       LibvirtConfigGuestInterface object.

    Change-Id: I7588c17a9936eb765269c662cad4cbedc5f58fbf
    Related-Bug: #1696125
    (cherry picked from commit fabfb347f51afbe9064a6d0061404316786140cb)

Reviewed: https://review.openstack.org/472345
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b2bafd09c3cde8cfd782a69aa8aa5dd53809d5bc
Submitter: Jenkins
Branch: stable/newton

commit b2bafd09c3cde8cfd782a69aa8aa5dd53809d5bc
Author: Lee Yarwood <email address hidden>
Date: Tue Jun 6 14:08:37 2017 -0400

    libvirt: Check if domain is persistent before detaching devices

    Previously the libvirt driver would always assume that it was only
    detaching devices (volumes or virtual interfaces) from a persistent
    domain however that is not always the case.

    For example when rolling back from a live migration an attempt is made
    to detach volumes from the transient destination domain that is being
    cleaned up. This attempt would fail with the previous assumption of the
    domain being persistent in place.

    This change introduces a simple call to has_persistent_configuration
    within detach_device_with_retry to confirm the state of the domain
    before attempting to detach.

    Closes-Bug: #1669857
    Closes-Bug: #1696125
    Change-Id: I95948721a0119f5f54dbe50d4455fd47d422164b
    (cherry picked from commit 563c0927d14d052e3f1fad80df95fe4a7c48d38b)
    (cherry picked from commit 00b7b714b37dbbaac8a3570541ecc6b75ce3a77f)

tags: added: in-stable-newton

Reviewed: https://review.openstack.org/491844
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2cf2b7ea6441c2eb73574a6e35ef10ae07550c6b
Submitter: Jenkins
Branch: stable/newton

commit 2cf2b7ea6441c2eb73574a6e35ef10ae07550c6b
Author: Matt Riedemann <email address hidden>
Date: Thu Jun 8 14:28:50 2017 -0400

    libvirt: fix alternative_device_name for detaching interfaces

    There are two changes here:

    1. Get the actual vif device name for logging, e.g. tap3e81295f-c1
    2. Log that in detach_device_with_retry so we're not logging an
       LibvirtConfigGuestInterface object.

    Change-Id: I7588c17a9936eb765269c662cad4cbedc5f58fbf
    Related-Bug: #1696125
    (cherry picked from commit fabfb347f51afbe9064a6d0061404316786140cb)
    (cherry picked from commit c04c5d57cc41deacec53e18abe3ba93a1d5b5ccf)

This issue was fixed in the openstack/nova 15.0.7 release.

This issue was fixed in the openstack/nova 14.0.8 release.

Steve Noyes (steve-noyes) wrote :

While debugging the LiveMigrationTest.test_iscsi_volume tempest test, I am seeing this in my local env as one of the causes of intermittent failures with this test. I can get this to fail maybe 2-3x a day. This is what I run to provoke it:

$ tempest run --regex test_live_migration --concurrency=4

I haven't seen it fail on a single threaded test, but it's hard to say since it fails so infrequently.

The failure in n-cpu.log is:

2017-09-20 14:01:50.192 ^[[01;31mERROR oslo_messaging.rpc.server [^[[01;36mreq-6db9664a-4c6e-4d47-bffc-224b45b34eb0 ^[[00;36mtempest-LiveMigrationTest-1489892041 tempest-LiveMigrationTest-1489892041^[[01;31m] ^[[01;35m^[[01;31mException during message handling^[[00m: DeviceDetachFailed: Device detach failed for vdb: Unable to detach from guest transient domain.

The tempest error is:

    tempest.lib.exceptions.TimeoutException: Request timed out
    Details: volume 3194b6d9-cb70-4e67-8c28-ebf9ea094297 failed to reach ['available'] status (current in-use) within the required time (196 s).

I am adding more debug code to help debug this.

Steve Noyes (steve-noyes) wrote :

Version of nova:

$ git show
commit e9daa28bae07fff7e617f03ef6b4956148db17f2
Merge: c1de4c7 10f8a9a
Author: Jenkins <email address hidden>
Date: Thu Aug 31 07:00:41 2017 +0000

    Merge "[placement] Require at least one resource class in allocation"

Steve Noyes (steve-noyes) wrote :

It doesn't look like a timeout issue. When the test is going to fail, you see these messages in the logs:

2017-09-19 14:13:01.172 DEBUG oslo.service.loopingcall [-] Invoking nova.virt.libvirt.guest._do_wait_and_retry_detach; retry count is 6. from (pid=611) _func /usr/local/lib/python2.7/dist-packages/oslo_service/loopingcall.py:393
...
2017-09-19 14:13:23.765 DEBUG oslo.service.loopingcall [-] Invoking nova.virt.libvirt.guest._do_wait_and_retry_detach; retry count is 7. from (pid=611) _func /usr/local/lib/python2.7/dist-packages/oslo_service/loopingcall.py:393

When the test passes, you don't see these messages, so if retries start happening, the detach will eventually fail.

melanie witt (melwitt) wrote :

Something to know about detaching from a guest transient domain is that it's an asynchronous request to the guest OS [1], which the guest OS can choose to ignore (e.g. if it has a file open or is otherwise busy).

"Beware that depending on the hypervisor and device type, detaching a device from a running domain may be asynchronous. That is, calling virDomainDetachDeviceFlags may just request device removal while the device is actually removed later (in cooperation with a guest OS)."

So it seems like what you're seeing is the guest OS ignoring the request to detach for some reason and Nova keeps retrying and eventually gives up. The retry simply requests the detach again and then checks for existence of the device. If the device remains, it tries again, until the max retry count is reached.

[1] https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainDetachDeviceFlags

Steve Noyes (steve-noyes) wrote :

Thanks for the comment. In this case, that shouldn't be the case as I don't think the volume was ever mounted by the instance's OS. From what I've seen cirros doesn't automount volumes that get attached. And if it's not mounted, I can't think of anything that the OS would do to prevent it from be detached. But I certainly could be missing something...

Dominic (abhi-02666) wrote :

It is a great post. It's been such a wonderful experience while reading your blog. If facing any problem
 related to a Linux please visit https://www.europewebsitedevelopment.com/ for any technical
support assistance

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers