test_legacy_router test cases sometimes fail with: Cannot find device "qg-*" when moving it into a namespace

Bug #1683227 reported by Ihar Hrachyshka
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Expired
High
Unassigned

Bug Description

http://logs.openstack.org/84/355284/2/check/gate-neutron-dsvm-functional-ubuntu-xenial/d7635bd/testr_results.html.gz

Traceback (most recent call last):
  File "neutron/tests/base.py", line 113, in func
    return f(self, *args, **kwargs)
  File "neutron/tests/functional/agent/l3/test_legacy_router.py", line 341, in test_connection_from_same_address_scope
    'scope1', 'scope1')
  File "neutron/tests/functional/agent/l3/test_legacy_router.py", line 318, in _setup_address_scope
    router = self.manage_router(self.agent, router_info)
  File "neutron/tests/functional/agent/l3/framework.py", line 319, in manage_router
    agent._process_added_router(router)
  File "neutron/agent/l3/agent.py", line 455, in _process_added_router
    ri.process()
  File "neutron/common/utils.py", line 186, in call
    self.logger(e)
  File "/opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
    self.force_reraise()
  File "/opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
    six.reraise(self.type_, self.value, self.tb)
  File "neutron/common/utils.py", line 183, in call
    return func(*args, **kwargs)
  File "neutron/agent/l3/router_info.py", line 1114, in process
    self.process_external()
  File "neutron/agent/l3/router_info.py", line 889, in process_external
    self._process_external_gateway(ex_gw_port)
  File "neutron/agent/l3/router_info.py", line 772, in _process_external_gateway
    self.external_gateway_added(ex_gw_port, interface_name)
  File "neutron/agent/l3/router_info.py", line 722, in external_gateway_added
    ex_gw_port, interface_name, self.ns_name, preserve_ips)
  File "neutron/agent/l3/router_info.py", line 671, in _external_gateway_added
    self._plug_external_gateway(ex_gw_port, interface_name, ns_name)
  File "neutron/agent/l3/router_info.py", line 621, in _plug_external_gateway
    mtu=ex_gw_port.get('mtu'))
  File "neutron/agent/linux/interface.py", line 266, in plug
    bridge, namespace, prefix, mtu)
  File "neutron/agent/linux/interface.py", line 380, in plug_new
    namespace_obj.add_device_to_namespace(ns_dev)
  File "neutron/agent/linux/ip_lib.py", line 231, in add_device_to_namespace
    device.link.set_netns(self.namespace)
  File "neutron/agent/linux/ip_lib.py", line 519, in set_netns
    self._as_root([], ('set', self.name, 'netns', namespace))
  File "neutron/agent/linux/ip_lib.py", line 367, in _as_root
    use_root_namespace=use_root_namespace)
  File "neutron/agent/linux/ip_lib.py", line 99, in _as_root
    log_fail_as_error=self.log_fail_as_error)
  File "neutron/agent/linux/ip_lib.py", line 108, in _execute
    log_fail_as_error=log_fail_as_error)
  File "neutron/agent/linux/utils.py", line 151, in execute
    raise ProcessExecutionError(msg, returncode=returncode)
neutron.agent.linux.utils.ProcessExecutionError: Exit code: 1; Stdin: ; Stdout: ; Stderr: Cannot find device "qg-74372988-7f"

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Interestingly, the device was present just before the failure, when setting MAC address:

2017-04-17 03:24:57.197 3562 DEBUG neutron.agent.linux.utils [req-e6df0f81-aea0-42d4-a857-5b0514d95952 - - - - -] Running command (rootwrap daemon): ['ip', 'link', 'set', 'qg-74372988-7f', 'address', 'ca:fe:de:ad:be:ee'] execute_rootwrap_daemon neutron/agent/linux/utils.py:108
2017-04-17 03:24:57.198 3562 DEBUG neutron.agent.ovsdb.native.vlog [-] [POLLIN] on fd 19 __log_wakeup /opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/ovs/poller.py:246
2017-04-17 03:24:57.216 3562 DEBUG neutron.agent.linux.utils [req-e6df0f81-aea0-42d4-a857-5b0514d95952 - - - - -] Exit code: 0 execute neutron/agent/linux/utils.py:153

And if we look into vswitchd logs, we see the device added twice:

2017-04-17T03:24:57.134Z|01487|bridge|INFO|bridge test-brd1242ba3: added interface qg-74372988-7f on port 1
[...]
2017-04-17T03:24:57.306Z|01492|bridge|INFO|bridge test-brd1242ba3: added interface qg-74372988-7f on port 1

Note the failure happened between those two.

Changed in neutron:
importance: Undecided → High
status: New → Confirmed
Miguel Lavalle (minsel)
Changed in neutron:
assignee: nobody → Miguel Lavalle (minsel)
Revision history for this message
Miguel Lavalle (minsel) wrote :
Changed in neutron:
status: Confirmed → Incomplete
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :
Changed in neutron:
status: Incomplete → Confirmed
summary: - test_connection_from_same_address_scope failed with: Cannot find device
- "qg-74372988-7f"
+ test_legacy_router test cases sometimes fail with: Cannot find device
+ "qg-*" when moving it into a namespace
Revision history for this message
Miguel Lavalle (minsel) wrote :
Revision history for this message
Miguel Lavalle (minsel) wrote :

Analyzing the failure in note #3, I can see that the sequence of events is exactly the same as described in note #1, with the only difference that in one case the port being being connected is the qg-* versus the qr-*. The failures happen when plugging the ports to the OVS bridge, specifically between adding the port to the bridge and adding it to the namespace here: https://github.com/openstack/neutron/blob/master/neutron/agent/linux/interface.py#L360-L380.

The sequence of events is the following:

1) Interface added to the bridge
2) OVS logs the addition of the interface: 2017-07-04T03:17:42.066Z|02473|bridge|INFO|bridge test-brc89e8c8a: added interface qr-c3caa17a-3e on port 1
3) The interface MAC address is set correctly
4) Moving the interface to the namespace fails
5) OVS logs again the addition of the interface: 2017-07-04T03:17:42.145Z|02478|bridge|INFO|bridge test-brc89e8c8a: added interface qr-c3caa17a-3e on port 1

I checked a successful execution of the test case pointed out by note #3 (neutron.tests.functional.agent.l3.test_legacy_router.L3AgentTestCase.test_legacy_router_gateway_update_to_none) and the addition of the interface is logged by OVS only once. So it seems that we have another case of the "shy port syndrome" that was described here: https://review.openstack.org/#/c/371032/.

Next steps:

1) Follow up with otherwiseguy to see if a fix has been created for what seems to be an underlying OVS problem
2) If step 1 fails, maybe extend the solution in https://review.openstack.org/#/c/371032/ to encompass the addition of the interface to the namespace

Revision history for this message
Miguel Lavalle (minsel) wrote :

otherwiseguy indicated the fix for this was supposed to be: https://review.openstack.org/#/c/344859

Revision history for this message
Miguel Lavalle (minsel) wrote :

Spotted another hit in Kibana: http://logs.openstack.org/31/480431/4/check/gate-neutron-dsvm-functional-ubuntu-xenial/e0a47db/console.html#_2017-07-12_08_15_34_920780. Seems to be happening at a rate of about once every 7 days.

Combining entries from the failed functional test (neutron.tests.functional.agent.l3.test_ha_router.L3HATestFailover.test_ha_router_failover) log and the ovs-vswitchd log, we have the following sequence of events: http://paste.openstack.org/show/615193/

Next step is to file a bug with the OVS folks

Revision history for this message
Miguel Lavalle (minsel) wrote :

14 days without a hit in Kibana using query in #4. Will continue monitoring

Revision history for this message
Miguel Lavalle (minsel) wrote :

This 21st day that no hist are found in Kibana with query in #4. Closing it. Please feel free to reopen and report other occurrences if necessary

Changed in neutron:
status: Confirmed → Incomplete
Changed in neutron:
assignee: Miguel Lavalle (minsel) → nobody
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.