test_reassign_port_between_servers failing with tap device is busy errors in neutron xenial jobs since 7/28

Bug #1607714 reported by Kevin Benton
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Matt Riedemann
Newton
Fix Committed
Medium
Tony Breeds
Ocata
Fix Committed
Medium
Matt Riedemann

Bug Description

Recent failures showing up in the tempest tests for 'test_reassign_port_between_servers'.

From n-cpu.log (http://logs.openstack.org/43/298443/3/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/5acdcca/logs/screen-n-cpu.txt.gz#_2016-07-29_07_25_47_948):

2016-07-29 07:25:47.948 20203 ERROR nova.virt.libvirt.driver [req-572c16fd-696d-44ab-a633-49e6625b8f9c tempest-AttachInterfacesTestJSON-1497713114 tempest-AttachInterfacesTestJSON-1497713114] [instance: 73323063-7cc3-4645-9a68-662bf80d9e42] attaching network adapter failed.
2016-07-29 07:25:47.948 20203 ERROR nova.virt.libvirt.driver [instance: 73323063-7cc3-4645-9a68-662bf80d9e42] Traceback (most recent call last):
2016-07-29 07:25:47.948 20203 ERROR nova.virt.libvirt.driver [instance: 73323063-7cc3-4645-9a68-662bf80d9e42] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 1389, in attach_interface
2016-07-29 07:25:47.948 20203 ERROR nova.virt.libvirt.driver [instance: 73323063-7cc3-4645-9a68-662bf80d9e42] guest.attach_device(cfg, persistent=True, live=live)
2016-07-29 07:25:47.948 20203 ERROR nova.virt.libvirt.driver [instance: 73323063-7cc3-4645-9a68-662bf80d9e42] File "/opt/stack/new/nova/nova/virt/libvirt/guest.py", line 295, in attach_device
2016-07-29 07:25:47.948 20203 ERROR nova.virt.libvirt.driver [instance: 73323063-7cc3-4645-9a68-662bf80d9e42] self._domain.attachDeviceFlags(device_xml, flags=flags)
2016-07-29 07:25:47.948 20203 ERROR nova.virt.libvirt.driver [instance: 73323063-7cc3-4645-9a68-662bf80d9e42] File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 186, in doit
2016-07-29 07:25:47.948 20203 ERROR nova.virt.libvirt.driver [instance: 73323063-7cc3-4645-9a68-662bf80d9e42] result = proxy_call(self._autowrap, f, *args, **kwargs)
2016-07-29 07:25:47.948 20203 ERROR nova.virt.libvirt.driver [instance: 73323063-7cc3-4645-9a68-662bf80d9e42] File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 144, in proxy_call
2016-07-29 07:25:47.948 20203 ERROR nova.virt.libvirt.driver [instance: 73323063-7cc3-4645-9a68-662bf80d9e42] rv = execute(f, *args, **kwargs)
2016-07-29 07:25:47.948 20203 ERROR nova.virt.libvirt.driver [instance: 73323063-7cc3-4645-9a68-662bf80d9e42] File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 125, in execute
2016-07-29 07:25:47.948 20203 ERROR nova.virt.libvirt.driver [instance: 73323063-7cc3-4645-9a68-662bf80d9e42] six.reraise(c, e, tb)
2016-07-29 07:25:47.948 20203 ERROR nova.virt.libvirt.driver [instance: 73323063-7cc3-4645-9a68-662bf80d9e42] File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 83, in tworker
2016-07-29 07:25:47.948 20203 ERROR nova.virt.libvirt.driver [instance: 73323063-7cc3-4645-9a68-662bf80d9e42] rv = meth(*args, **kwargs)
2016-07-29 07:25:47.948 20203 ERROR nova.virt.libvirt.driver [instance: 73323063-7cc3-4645-9a68-662bf80d9e42] File "/usr/local/lib/python2.7/dist-packages/libvirt.py", line 560, in attachDeviceFlags
2016-07-29 07:25:47.948 20203 ERROR nova.virt.libvirt.driver [instance: 73323063-7cc3-4645-9a68-662bf80d9e42] if ret == -1: raise libvirtError ('virDomainAttachDeviceFlags() failed', dom=self)
2016-07-29 07:25:47.948 20203 ERROR nova.virt.libvirt.driver [instance: 73323063-7cc3-4645-9a68-662bf80d9e42] libvirtError: Unable to create tap device tap9dd515f0-ac: Device or resource busy
2016-07-29 07:25:47.948 20203 ERROR nova.virt.libvirt.driver [instance: 73323063-7cc3-4645-9a68-662bf80d9e42]

Revision history for this message
Stuart McLaren (stuart-mclaren) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote : Re: interface attach tests failing with tap device is busy errors in neutron xenial jobs since 7/28

logstash shows it starting on 7/28: http://goo.gl/u4kqup

summary: - failure attaching interface
+ interface attach tests failing with tap device is busy errors in neutron
+ xenial jobs since 7/28
Changed in nova:
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Matt Riedemann (mriedem) wrote :

This change switched the neutron gate jobs over to xenial:

https://review.openstack.org/#/c/348078/

Revision history for this message
Matt Riedemann (mriedem) wrote :

Looks like it might just be failing in this test:

tempest.api.compute.servers.test_attach_interfaces.AttachInterfacesTestJSON.test_reassign_port_between_servers

Revision history for this message
Matt Riedemann (mriedem) wrote :

Looks like possibly an issue with glean?

http://logs.openstack.org/09/347509/4/gate/gate-tempest-dsvm-neutron-full-ubuntu-xenial/2051841/logs/screen-n-cpu.txt.gz#_2016-07-29_12_59_23_311

attach device xml:
<interface type="bridge">
  <mac address="fa:16:3e:79:df:76"/>
  <model type="virtio"/>
  <driver name="qemu"/>
  <source bridge="qbr953222b9-a8"/>
  <target dev="tap953222b9-a8"/>
</interface>

http://logs.openstack.org/09/347509/4/gate/gate-tempest-dsvm-neutron-full-ubuntu-xenial/2051841/logs/syslog.txt.gz#_Jul_29_12_59_21

Jul 29 12:59:21 ubuntu-xenial-osic-cloud1-3018328 glean.sh[373]: Traceback (most recent call last):
Jul 29 12:59:21 ubuntu-xenial-osic-cloud1-3018328 glean.sh[373]: File "/usr/local/bin/glean", line 11, in <module>
Jul 29 12:59:21 ubuntu-xenial-osic-cloud1-3018328 glean.sh[373]: sys.exit(main())
Jul 29 12:59:21 ubuntu-xenial-osic-cloud1-3018328 glean.sh[373]: File "/usr/local/lib/python2.7/dist-packages/glean/cmd.py", line 722, in main
Jul 29 12:59:21 ubuntu-xenial-osic-cloud1-3018328 glean.sh[373]: write_network_info_from_config_drive(args)
Jul 29 12:59:21 ubuntu-xenial-osic-cloud1-3018328 glean.sh[373]: File "/usr/local/lib/python2.7/dist-packages/glean/cmd.py", line 562, in write_network_info_from_config_drive
Jul 29 12:59:21 ubuntu-xenial-osic-cloud1-3018328 glean.sh[373]: sys_interfaces = get_sys_interfaces(args.interface, args)
Jul 29 12:59:21 ubuntu-xenial-osic-cloud1-3018328 glean.sh[373]: File "/usr/local/lib/python2.7/dist-packages/glean/cmd.py", line 508, in get_sys_interfaces
Jul 29 12:59:21 ubuntu-xenial-osic-cloud1-3018328 glean.sh[373]: '%s/%s/addr_assign_type' % (sys_root, iface), 'r').read().strip()
Jul 29 12:59:21 ubuntu-xenial-osic-cloud1-3018328 glean.sh[373]: IOError: [Errno 2] No such file or directory: '/sys/class/net/qbr953222b9/a8/addr_assign_type'

Matt Riedemann (mriedem)
summary: - interface attach tests failing with tap device is busy errors in neutron
- xenial jobs since 7/28
+ test_reassign_port_between_servers failing with tap device is busy
+ errors in neutron xenial jobs since 7/28
Revision history for this message
Matt Riedemann (mriedem) wrote :

The device detach from the guest in libvirt is asynchronous, so this must be much slower in libvirt 1.3.1 on xenial nodes, so neutron is telling us that the port is detach (device_id is None on the port) after detach - which is what the tempest test is polling on - before the interface is actually detached from the guest.

So we probably need a retry loop in detach_interface in the libvirt driver (like we have for detach_volume) to retry until timeout for the interface to be gone from the guest and consider the detach successful.

Revision history for this message
Matt Riedemann (mriedem) wrote :

However, detach_interface is a cast from compute API to the compute manager, so the Tempest test doesn't really have a way to poll that the interface is actually detached from the guest (beyond doing something like ssh'ing into the guest to verify the interface with the given mac is gone).

Revision history for this message
Matt Riedemann (mriedem) wrote :

Nevermind, it's the compute manager in nova that's telling neutron that the port is no longer bound:

https://github.com/openstack/nova/blob/fdf3328107e53f1c5578c2e4dfbad78d832b01c6/nova/compute/manager.py#L4990

We first call the virt driver to detach the interface (which is async) and then update the port telling neutron that the device_id is '', which is what tempest is waiting for.

So if we add a poll / retry in the libvirt guest module for the detach, we can delay the port update which tempest is waiting for and we should be good.

Matt Riedemann (mriedem)
Changed in nova:
status: Confirmed → Triaged
assignee: nobody → Matt Riedemann (mriedem)
Revision history for this message
Matt Riedemann (mriedem) wrote :

Skipping the test in tempest for now: https://review.openstack.org/#/c/348955/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/349014

Changed in nova:
status: Triaged → In Progress
Revision history for this message
Sean Dague (sdague) wrote :

Not seen in the gate any more, the fixing patch is in merge conflict and really old

Changed in nova:
status: In Progress → Invalid
assignee: Matt Riedemann (mriedem) → nobody
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Sean Dague (<email address hidden>) on branch: master
Review: https://review.openstack.org/349014
Reason: This review is > 6 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Invalid → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/431778

Revision history for this message
Matt Riedemann (mriedem) wrote :

It's going to be hard getting this back to Newton because it depends on:

https://review.openstack.org/#/c/372243/

Which isn't in Newton. It's in stable/ocata though.

As noted in that patch, it's tests are broken too and being fixed here:

https://review.openstack.org/#/c/431778/

tags: added: libvirt neutron
Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/431778
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1ecd71b08d14450e475dc9512d40828da6fcfe15
Submitter: Jenkins
Branch: master

commit 1ecd71b08d14450e475dc9512d40828da6fcfe15
Author: Matt Riedemann <email address hidden>
Date: Thu Feb 9 18:41:11 2017 -0500

    libvirt: fix and break up _test_attach_detach_interface

    The detach_interface flow in this test was broken because
    it wasn't mocking out domain.detachDeviceFlags so the xml
    it was expecting to be passed to that method wasn't actually
    being verified. The same thing is broken in test
    test_detach_interface_device_with_same_mac_address because
    it copies the other broken test code.

    This change breaks apart the monster attach/detach test method
    and converts the detach_interface portion to mock and fixes
    the broken assertion.

    test_detach_interface_device_with_same_mac_address is just
    fixed, not converted to mock.

    Change-Id: I6d9a975876c5652ef544c587f65b1bdd1543848b
    Related-Bug: #1607714

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/ocata)

Related fix proposed to branch: stable/ocata
Review: https://review.openstack.org/448188

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/448189

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/349014
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a3b3e8d8314b0cedc2604be509f0f4d523a35ed5
Submitter: Jenkins
Branch: master

commit a3b3e8d8314b0cedc2604be509f0f4d523a35ed5
Author: Matt Riedemann <email address hidden>
Date: Thu Feb 9 15:54:41 2017 -0500

    libvirt: wait for interface detach from the guest

    The test_reassign_port_between_servers test in Tempest creates
    a port in neutron and two servers. It attaches the port to the
    first server and then quickly detaches it and waits for the
    port.device_id to be unbound from the server. Then it repeats
    that for the second server.

    The interface detach from the guest is asynchronous and happens
    before nova unbinds the port, so there is a race where the port's
    device_id is unset but the interface is still on the first guest
    when we try to attach to the second guest, which fails.

    This is a latent bug, but apparently has been tickled by the
    move to our neutron CI jobs to use ubuntu xenial nodes.

    The fix is to add a detach and retry loop on the interface detach
    on the guest so that we can wait until the interface is gone
    from the guest before nova unbinds the port in neutron, which is
    what the Tempest test is waiting for. Then the device should be
    available for attaching to the second guest.

    This is similar to what we do with detaching volumes.

    Closes-Bug: #1607714

    Change-Id: Ic04aad8923ea2edf1d16e32c208cd41fdf898834

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/newton)

Related fix proposed to branch: stable/newton
Review: https://review.openstack.org/470348

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/470349

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 16.0.0.0b2

This issue was fixed in the openstack/nova 16.0.0.0b2 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/ocata)

Reviewed: https://review.openstack.org/448188
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c0820944ea8554f5b1db0d538781caf0e75b6b0e
Submitter: Jenkins
Branch: stable/ocata

commit c0820944ea8554f5b1db0d538781caf0e75b6b0e
Author: Matt Riedemann <email address hidden>
Date: Thu Feb 9 18:41:11 2017 -0500

    libvirt: fix and break up _test_attach_detach_interface

    The detach_interface flow in this test was broken because
    it wasn't mocking out domain.detachDeviceFlags so the xml
    it was expecting to be passed to that method wasn't actually
    being verified. The same thing is broken in test
    test_detach_interface_device_with_same_mac_address because
    it copies the other broken test code.

    This change breaks apart the monster attach/detach test method
    and converts the detach_interface portion to mock and fixes
    the broken assertion.

    test_detach_interface_device_with_same_mac_address is just
    fixed, not converted to mock.

    Change-Id: I6d9a975876c5652ef544c587f65b1bdd1543848b
    Related-Bug: #1607714
    (cherry picked from commit 1ecd71b08d14450e475dc9512d40828da6fcfe15)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/ocata)

Reviewed: https://review.openstack.org/448189
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=02ad4f862a7c5b51100288b6b22f15087788d8d7
Submitter: Jenkins
Branch: stable/ocata

commit 02ad4f862a7c5b51100288b6b22f15087788d8d7
Author: Matt Riedemann <email address hidden>
Date: Thu Feb 9 15:54:41 2017 -0500

    libvirt: wait for interface detach from the guest

    The test_reassign_port_between_servers test in Tempest creates
    a port in neutron and two servers. It attaches the port to the
    first server and then quickly detaches it and waits for the
    port.device_id to be unbound from the server. Then it repeats
    that for the second server.

    The interface detach from the guest is asynchronous and happens
    before nova unbinds the port, so there is a race where the port's
    device_id is unset but the interface is still on the first guest
    when we try to attach to the second guest, which fails.

    This is a latent bug, but apparently has been tickled by the
    move to our neutron CI jobs to use ubuntu xenial nodes.

    The fix is to add a detach and retry loop on the interface detach
    on the guest so that we can wait until the interface is gone
    from the guest before nova unbinds the port in neutron, which is
    what the Tempest test is waiting for. Then the device should be
    available for attaching to the second guest.

    This is similar to what we do with detaching volumes.

    Closes-Bug: #1607714

    Change-Id: Ic04aad8923ea2edf1d16e32c208cd41fdf898834
    (cherry picked from commit a3b3e8d8314b0cedc2604be509f0f4d523a35ed5)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/newton)

Reviewed: https://review.openstack.org/470348
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=70c44831a5cdf605365a0b60994e2a458907d27f
Submitter: Jenkins
Branch: stable/newton

commit 70c44831a5cdf605365a0b60994e2a458907d27f
Author: Matt Riedemann <email address hidden>
Date: Thu Feb 9 18:41:11 2017 -0500

    libvirt: fix and break up _test_attach_detach_interface

    The detach_interface flow in this test was broken because
    it wasn't mocking out domain.detachDeviceFlags so the xml
    it was expecting to be passed to that method wasn't actually
    being verified. The same thing is broken in test
    test_detach_interface_device_with_same_mac_address because
    it copies the other broken test code.

    This change breaks apart the monster attach/detach test method
    and converts the detach_interface portion to mock and fixes
    the broken assertion.

    test_detach_interface_device_with_same_mac_address is just
    fixed, not converted to mock.

    Conflicts:
          nova/tests/unit/virt/libvirt/test_driver.py

    NOTE(mriedem): The conflict is due to change
    I5c461a8242c51994d12ce9c6774d5f956232f950 not being in Newton.

    Change-Id: I6d9a975876c5652ef544c587f65b1bdd1543848b
    Related-Bug: #1607714
    (cherry picked from commit 1ecd71b08d14450e475dc9512d40828da6fcfe15)
    (cherry picked from commit c0820944ea8554f5b1db0d538781caf0e75b6b0e)

tags: added: in-stable-newton
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/newton)

Reviewed: https://review.openstack.org/470349
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1e66b034eb2c7176d9e163d7944b04d1928e148f
Submitter: Jenkins
Branch: stable/newton

commit 1e66b034eb2c7176d9e163d7944b04d1928e148f
Author: Matt Riedemann <email address hidden>
Date: Thu Feb 9 15:54:41 2017 -0500

    libvirt: wait for interface detach from the guest

    The test_reassign_port_between_servers test in Tempest creates
    a port in neutron and two servers. It attaches the port to the
    first server and then quickly detaches it and waits for the
    port.device_id to be unbound from the server. Then it repeats
    that for the second server.

    The interface detach from the guest is asynchronous and happens
    before nova unbinds the port, so there is a race where the port's
    device_id is unset but the interface is still on the first guest
    when we try to attach to the second guest, which fails.

    This is a latent bug, but apparently has been tickled by the
    move to our neutron CI jobs to use ubuntu xenial nodes.

    The fix is to add a detach and retry loop on the interface detach
    on the guest so that we can wait until the interface is gone
    from the guest before nova unbinds the port in neutron, which is
    what the Tempest test is waiting for. Then the device should be
    available for attaching to the second guest.

    This is similar to what we do with detaching volumes.

    Closes-Bug: #1607714

    Conflicts:
          nova/tests/unit/virt/libvirt/test_driver.py

    NOTE(mriedem): The conflict is due to change
    I5c461a8242c51994d12ce9c6774d5f956232f950 not being in Newton.

    Change-Id: Ic04aad8923ea2edf1d16e32c208cd41fdf898834
    (cherry picked from commit a3b3e8d8314b0cedc2604be509f0f4d523a35ed5)
    (cherry picked from commit ca0a46e36615f227f91f92d746916bbf17d1143c)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 15.0.7

This issue was fixed in the openstack/nova 15.0.7 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 14.0.8

This issue was fixed in the openstack/nova 14.0.8 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.