test_connection_from_diff_address_scope intermittent "Cannot find device" errors

Bug #1618987 reported by Assaf Muller
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Critical
Kevin Benton

Bug Description

Example TRACE:
http://logs.openstack.org/58/360858/4/check/gate-neutron-dsvm-functional/3fb0ba3/console.html#_2016-08-30_23_25_18_854125

It looks like OVSDB adds an internal OVS port, then when we try to set the MTU, the linux network stack cannot find the device.

I'm under the impression that https://review.openstack.org/#/c/344859/ was supposed to solve this class of issues.

Revision history for this message
Brian Haley (brian-haley) wrote :

It's actually trying to set the MAC address, either way the port isn't there.

So I'm pretty good at finding red herrings. Since there was a port added right before this one I found the name in ovs-vswitchd.log. There were only two occurrences:

2016-08-30T23:14:21.077Z|00731|bridge|INFO|bridge test-brf4f80ab5: added interface portcfbbde on port 3
2016-08-30T23:14:21.243Z|00735|netdev_linux|INFO|ioctl(SIOCGIFHWADDR) on portcfbbde device failed: No such device

porta0d188 had two hits as well, but both were adding it:

2016-08-30T23:14:21.231Z|00732|bridge|INFO|bridge test-br103b76e6: added interface porta0d188 on port 5
2016-08-30T23:14:21.350Z|00740|bridge|INFO|bridge test-br103b76e6: added interface porta0d188 on port 5

The log is pretty big, but I couldn't find another occurrence where a port was added twice in a row, seemed odd.

The functional test log seems to show a clean DelPortCommand/AddPortCommand/DbSetCommand for the port though.

Revision history for this message
Terry Wilson (otherwiseguy) wrote :

There are cases where the linked patch won't "fix" the issue--especially if the calling code doesn't ignores the Exception when ovs-vswitchd is down and things time out. Looking at the functional test logs, I don't see the timeout exception logged, so I don't *think* that is what happened here.

I wrote a little script ( https://gist.github.com/otherwiseguy/e1784525de342edc3e361a33407437b0 ) that just does a replace_port call followed by an ip link set address call just like plug() does and let it run several runs of 10k loops, and one of ~50k without running into an issue. The only way I can get the RuntimeError from ip_lib is if I catch and ignore the TimeoutException.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

Terry,

Can you take a look at the failure here:

http://logs.openstack.org/95/366495/1/check/gate-neutron-dsvm-functional-ubuntu-trusty/c10f8aa/logs/dsvm-functional-logs/neutron.tests.functional.agent.l3.test_dvr_router.TestDvrRouter.test_dvr_router_lifecycle_without_ha_with_snat_with_fips.txt.gz#_2016-09-07_05_38_49_471

I don't see any error being swallowed and it's all happening in less than a second so I don't think it would be a timeout being ignored.

Changed in neutron:
importance: Low → Critical
importance: Critical → High
Revision history for this message
Kevin Benton (kevinbenton) wrote :

I'm upgrading to High for now because this is breaking other tests now.

Changed in neutron:
milestone: none → newton-rc1
tags: added: gate-failure
Changed in neutron:
importance: High → Critical
Revision history for this message
Terry Wilson (otherwiseguy) wrote :

Whatever the issue is, my test script won't reproduce it. It successfully ran replace_port/set_address over 780,000 times this weekend.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

I wonder if this was because our build script was broken for OVS and it wasn't being compiled so we got the old smelly version...

https://github.com/openstack/neutron/commit/844b799e35ecda297d82d88c83f2df7439827b33

Revision history for this message
Terry Wilson (otherwiseguy) wrote :

Kevin: From the log you posted earlier, the ovs-vswitchd log looks like it is saying 2.5.1: http://logs.openstack.org/95/366495/1/check/gate-neutron-dsvm-functional-ubuntu-trusty/c10f8aa/logs/openvswitch/ovs-vswitchd.txt.gz

Revision history for this message
Kevin Benton (kevinbenton) wrote :
Download full text (3.4 KiB)

Check out the logs from http://logs.openstack.org/58/360858/4/check/gate-neutron-dsvm-functional/3fb0ba3/:

The port briefly appears, disappears, then returns about 10 seconds later. ovs-vswitchd:

2016-08-30T23:14:21.077Z|00731|bridge|INFO|bridge test-brf4f80ab5: added interface portcfbbde on port 3
2016-08-30T23:14:21.231Z|00732|bridge|INFO|bridge test-br103b76e6: added interface porta0d188 on port 5
2016-08-30T23:14:21.233Z|00733|ofproto|WARN|test-brd2a45bde: cannot get STP status on nonexistent port 1
2016-08-30T23:14:21.233Z|00734|ofproto|WARN|test-brd2a45bde: cannot get RSTP status on nonexistent port 1
2016-08-30T23:14:21.243Z|00735|netdev_linux|INFO|ioctl(SIOCGIFHWADDR) on portcfbbde device failed: No such device
2016-08-30T23:14:21.279Z|00736|bridge|WARN|could not open network device test-veth1cc1f5 (No such device)
2016-08-30T23:14:21.281Z|00737|bridge|INFO|bridge test-br13e2a48f: added interface fg-d2460f21-26 on port 1
2016-08-30T23:14:21.330Z|00738|bridge|INFO|bridge test-brd2a45bde: added interface test8bc7fc8a on port 2
2016-08-30T23:14:21.345Z|00739|bridge|WARN|could not open network device test-veth1cc1f5 (No such device)
2016-08-30T23:14:21.350Z|00740|bridge|INFO|bridge test-br103b76e6: added interface porta0d188 on port 5

Corresponding timing on functional test:

2016-08-30 23:14:21.081 28059 DEBUG neutron.agent.linux.utils [req-9b1471ca-d98a-4ae2-a406-0617a68f0d78 - - - - -] Running command: ['ip', '-o', 'link', 'show', 'test-br103b76e6'] create_process neutron/agent/linux/utils.py:83
2016-08-30 23:14:21.104 28059 DEBUG neutron.agent.linux.utils [req-9b1471ca-d98a-4ae2-a406-0617a68f0d78 - - - - -] Exit code: 0 execute neutron/agent/linux/utils.py:140
2016-08-30 23:14:21.106 28059 DEBUG neutron.agent.ovsdb.impl_idl [-] Running txn command(idx=0): DelPortCommand(bridge=None, if_exists=True, port=porta0d188) do_commit neutron/agent/ovsdb/impl_idl.py:98
2016-08-30 23:14:21.106 28059 DEBUG neutron.agent.ovsdb.impl_idl [-] Transaction caused no change do_commit neutron/agent/ovsdb/impl_idl.py:125
2016-08-30 23:14:21.107 28059 DEBUG neutron.agent.ovsdb.impl_idl [-] Running txn command(idx=0): AddPortCommand(bridge=test-br103b76e6, port=porta0d188, may_exist=False) do_commit neutron/agent/ovsdb/impl_idl.py:98
2016-08-30 23:14:21.108 28059 DEBUG neutron.agent.ovsdb.impl_idl [-] Running txn command(idx=1): DbSetCommand(table=Interface, record=porta0d188, col_values=(('type', 'internal'), ('external_ids', {'iface-status': 'active', 'iface-id': '91ce28ed-5bda-4371-82f9-91ca73d2d4b0', 'attached-mac': 'fa:16:3e:53:78:2d'}))) do_commit neutron/agent/ovsdb/impl_idl.py:98
2016-08-30 23:14:21.247 28059 DEBUG neutron.agent.linux.utils [req-9b1471ca-d98a-4ae2-a406-0617a68f0d78 - - - - -] Running command (rootwrap daemon): ['ip', 'link', 'set', 'porta0d188', 'address', 'fa:16:3e:53:78:2d'] execute_rootwrap_daemon neutron/agent/linux/utils.py:99
2016-08-30 23:14:21.254 28059 ERROR neutron.agent.linux.utils [req-9b1471ca-d98a-4ae2-a406-0617a68f0d78 - - - - -] Exit code: 1; Stdin: ; Stdout: ; Stderr: Cannot find device "porta0d188"

So ovs-vswitchd claims it added it at 21.231, but then the link set command issued from the agent at 21.247 s...

Read more...

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

I cannot find a good logstash query to look into recent failures. If we can't pull this off in time for RC1, this should probably be demoted.

Changed in neutron:
milestone: newton-rc1 → ocata-1
Changed in neutron:
assignee: nobody → Kevin Benton (kevinbenton)
status: Confirmed → In Progress
Changed in neutron:
milestone: ocata-1 → newton-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/371032
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a75ce6850f3954edafbb0c128750e39b57875743
Submitter: Jenkins
Branch: master

commit a75ce6850f3954edafbb0c128750e39b57875743
Author: Kevin Benton <email address hidden>
Date: Thu Sep 15 01:33:00 2016 -0700

    Retry setting mac address on new OVS port 10 times

    We've seen several times in the gate an OVS add port call
    succeed and then have the mac address set fail to find it
    seconds later. The vswitch log frequently shows that it
    returns milliseconds later. Until we get to the bottom of
    it, we should just retry several times before giving up
    and raising.

    Closes-Bug: #1618987
    Partial-Bug: #1623732
    Change-Id: Ia73a9be047093c02f61e3e9ce13d98dccd49dfeb

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 9.0.0.0rc1

This issue was fixed in the openstack/neutron 9.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.