pyroute2.NetNS don't work properly with concurrency in oslo.privsep

Bug #1811515 reported by Slawek Kaplonski
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Critical
Slawek Kaplonski

Bug Description

It is quite common case that tempest/scenario test jobs are failing because SSH to VM through floating IP fails.
It looks that in many (I'm not sure if all) case there are some errors from pyroute2 library in L3 agent logs and that looks like reason of many of such SSH failures.

Example of such fail in logs is e.g. in http://logs.openstack.org/43/628943/2/check/tempest-full/b955419/controller/logs/screen-q-l3.txt.gz#_Jan_11_23_41_31_955036

Stacktrace looks like below:

Jan 11 23:41:31.954562 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR pyroute2.netns.nslink [-] forced shutdown procedure, clean up netns manually
Jan 11 23:41:31.955036 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: DEBUG oslo.privsep.daemon [-] privsep: Exception during request[139718939505360]: [Errno 9] Bad file descriptor {{(pid=6501) _process_cmd /usr/local/lib/python2.7/dist-packages/oslo_privsep/daemon.py:452}}
Jan 11 23:41:31.955036 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: Traceback (most recent call last):
Jan 11 23:41:31.955036 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: File "/usr/local/lib/python2.7/dist-packages/oslo_privsep/daemon.py", line 447, in _process_cmd
Jan 11 23:41:31.955036 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ret = func(*f_args, **f_kwargs)
Jan 11 23:41:31.955036 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: File "/usr/local/lib/python2.7/dist-packages/oslo_privsep/priv_context.py", line 216, in _wrap
Jan 11 23:41:31.955036 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: return func(*args, **kwargs)
Jan 11 23:41:31.955036 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: File "/opt/stack/neutron/neutron/privileged/agent/linux/ip_lib.py", line 244, in add_ip_address
Jan 11 23:41:31.955036 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: scope=_get_scope_name(scope))
Jan 11 23:41:31.955036 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: File "/opt/stack/neutron/neutron/privileged/agent/linux/ip_lib.py", line 220, in _run_iproute_addr
Jan 11 23:41:31.955036 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: with _get_iproute(namespace) as ip:
Jan 11 23:41:31.955036 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: File "/opt/stack/neutron/neutron/privileged/agent/linux/ip_lib.py", line 169, in _get_iproute
Jan 11 23:41:31.955036 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: return pyroute2.NetNS(namespace, flags=0)
Jan 11 23:41:31.955036 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: File "/usr/local/lib/python2.7/dist-packages/pyroute2/netns/nslink.py", line 170, in __init__
Jan 11 23:41:31.955036 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: super(NetNS, self).__init__(trnsp_in, trnsp_out)
Jan 11 23:41:31.955036 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: File "/usr/local/lib/python2.7/dist-packages/pyroute2/iproute/linux.py", line 127, in __init__
Jan 11 23:41:31.955036 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: super(RTNL_API, self).__init__(*argv, **kwarg)
Jan 11 23:41:31.955036 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: File "/usr/local/lib/python2.7/dist-packages/pyroute2/remote/__init__.py", line 196, in __init__
Jan 11 23:41:31.955036 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: init = self.trnsp_in.recv_cmd()
Jan 11 23:41:31.955036 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: File "/usr/local/lib/python2.7/dist-packages/pyroute2/remote/__init__.py", line 77, in recv_cmd
Jan 11 23:41:31.955036 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ret = self.__recv()
Jan 11 23:41:31.955036 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: File "/usr/local/lib/python2.7/dist-packages/pyroute2/remote/__init__.py", line 51, in __recv
Jan 11 23:41:31.955036 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: dump.write(self.file_obj.read(length - 8))
Jan 11 23:41:31.955036 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: OSError: [Errno 9] Bad file descriptor
Jan 11 23:41:31.957344 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: DEBUG oslo.privsep.daemon [-] privsep: reply[139718939505360]: (5, 'exceptions.OSError', (9, 'Bad file descriptor')) {{(pid=6501) _call_back /usr/local/lib/python2.7/dist-packages/oslo_privsep/daemon.py:473}}
Jan 11 23:41:31.966453 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.router_info [-] [Errno 88] Socket operation on non-socket: error: [Errno 88] Socket operation on non-socket
Jan 11 23:41:31.966453 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.router_info Traceback (most recent call last):
Jan 11 23:41:31.966453 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.router_info File "/opt/stack/neutron/neutron/common/utils.py", line 158, in call
Jan 11 23:41:31.966453 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.router_info return func(*args, **kwargs)
Jan 11 23:41:31.966453 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.router_info File "/opt/stack/neutron/neutron/agent/l3/router_info.py", line 1175, in process
Jan 11 23:41:31.966453 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.router_info self._process_internal_ports()
Jan 11 23:41:31.966453 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.router_info File "/opt/stack/neutron/neutron/agent/l3/router_info.py", line 560, in _process_internal_ports
Jan 11 23:41:31.966453 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.router_info self.internal_network_added(p)
Jan 11 23:41:31.966453 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.router_info File "/opt/stack/neutron/neutron/agent/l3/router_info.py", line 477, in internal_network_added
Jan 11 23:41:31.966453 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.router_info mtu=port.get('mtu'))
Jan 11 23:41:31.966453 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.router_info File "/opt/stack/neutron/neutron/agent/l3/router_info.py", line 452, in _internal_network_added
Jan 11 23:41:31.966453 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.router_info prefix=prefix, mtu=mtu)
Jan 11 23:41:31.966453 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.router_info File "/opt/stack/neutron/neutron/agent/linux/interface.py", line 276, in plug
Jan 11 23:41:31.966453 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.router_info bridge, namespace, prefix, mtu)
Jan 11 23:41:31.966453 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.router_info File "/opt/stack/neutron/neutron/agent/linux/interface.py", line 359, in plug_new
Jan 11 23:41:31.966453 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.router_info self.check_bridge_exists(bridge)
Jan 11 23:41:31.966453 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.router_info File "/opt/stack/neutron/neutron/agent/linux/interface.py", line 233, in check_bridge_exists
Jan 11 23:41:31.966453 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.router_info if not ip_lib.device_exists(bridge):
Jan 11 23:41:31.966453 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.router_info File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 799, in device_exists
Jan 11 23:41:31.966453 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.router_info return IPDevice(device_name, namespace=namespace).exists()
Jan 11 23:41:31.966453 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.router_info File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 305, in exists
Jan 11 23:41:31.966453 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.router_info return privileged.interface_exists(self.name, self.namespace)
Jan 11 23:41:31.966453 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.router_info File "/usr/local/lib/python2.7/dist-packages/oslo_privsep/priv_context.py", line 214, in _wrap
Jan 11 23:41:31.966453 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.router_info return self.channel.remote_call(name, args, kwargs)
Jan 11 23:41:31.966453 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.router_info File "/usr/local/lib/python2.7/dist-packages/oslo_privsep/daemon.py", line 204, in remote_call
Jan 11 23:41:31.966453 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.router_info raise exc_type(*result[2])
Jan 11 23:41:31.966453 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.router_info error: [Errno 88] Socket operation on non-socket
Jan 11 23:41:31.966453 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.router_info
Jan 11 23:41:31.973050 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent [-] Failed to process compatible router: a4f60f08-8daa-4361-8b85-1a9b770ee03c: error: [Errno 88] Socket operation on non-socket
Jan 11 23:41:31.973050 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent Traceback (most recent call last):
Jan 11 23:41:31.973050 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent File "/opt/stack/neutron/neutron/agent/l3/agent.py", line 623, in _process_routers_if_compatible
Jan 11 23:41:31.973050 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent self._process_router_if_compatible(router)
Jan 11 23:41:31.973050 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent File "/opt/stack/neutron/neutron/agent/l3/agent.py", line 498, in _process_router_if_compatible
Jan 11 23:41:31.973050 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent self._process_added_router(router)
Jan 11 23:41:31.973050 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent File "/opt/stack/neutron/neutron/agent/l3/agent.py", line 506, in _process_added_router
Jan 11 23:41:31.973050 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent ri.process()
Jan 11 23:41:31.973050 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent File "/opt/stack/neutron/neutron/common/utils.py", line 161, in call
Jan 11 23:41:31.973050 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent self.logger(e)
Jan 11 23:41:31.973050 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
Jan 11 23:41:31.973050 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent self.force_reraise()
Jan 11 23:41:31.973050 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
Jan 11 23:41:31.973050 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent six.reraise(self.type_, self.value, self.tb)
Jan 11 23:41:31.973050 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent File "/opt/stack/neutron/neutron/common/utils.py", line 158, in call
Jan 11 23:41:31.973050 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent return func(*args, **kwargs)
Jan 11 23:41:31.973050 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent File "/opt/stack/neutron/neutron/agent/l3/router_info.py", line 1175, in process
Jan 11 23:41:31.973050 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent self._process_internal_ports()
Jan 11 23:41:31.973050 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent File "/opt/stack/neutron/neutron/agent/l3/router_info.py", line 560, in _process_internal_ports
Jan 11 23:41:31.973050 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent self.internal_network_added(p)
Jan 11 23:41:31.973050 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent File "/opt/stack/neutron/neutron/agent/l3/router_info.py", line 477, in internal_network_added
Jan 11 23:41:31.973050 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent mtu=port.get('mtu'))
Jan 11 23:41:31.973050 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent File "/opt/stack/neutron/neutron/agent/l3/router_info.py", line 452, in _internal_network_added
Jan 11 23:41:31.973050 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent prefix=prefix, mtu=mtu)
Jan 11 23:41:31.973050 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent File "/opt/stack/neutron/neutron/agent/linux/interface.py", line 276, in plug
Jan 11 23:41:31.973050 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent bridge, namespace, prefix, mtu)
Jan 11 23:41:31.973050 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent File "/opt/stack/neutron/neutron/agent/linux/interface.py", line 359, in plug_new
Jan 11 23:41:31.973050 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent self.check_bridge_exists(bridge)
Jan 11 23:41:31.973050 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent File "/opt/stack/neutron/neutron/agent/linux/interface.py", line 233, in check_bridge_exists
Jan 11 23:41:31.973050 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent if not ip_lib.device_exists(bridge):
Jan 11 23:41:31.973050 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 799, in device_exists
Jan 11 23:41:31.973050 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent return IPDevice(device_name, namespace=namespace).exists()
Jan 11 23:41:31.973050 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 305, in exists
Jan 11 23:41:31.973050 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent return privileged.interface_exists(self.name, self.namespace)
Jan 11 23:41:31.973050 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent File "/usr/local/lib/python2.7/dist-packages/oslo_privsep/priv_context.py", line 214, in _wrap
Jan 11 23:41:31.973050 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent return self.channel.remote_call(name, args, kwargs)
Jan 11 23:41:31.973050 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent File "/usr/local/lib/python2.7/dist-packages/oslo_privsep/daemon.py", line 204, in remote_call
Jan 11 23:41:31.973050 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent raise exc_type(*result[2])
Jan 11 23:41:31.977173 ubuntu-bionic-rax-iad-0001661199 neutron-l3-agent[5600]: ERROR neutron.agent.l3.agent error: [Errno 88] Socket operation on non-socket

It don't look like related to recent Oslo.privsep update as according to logstash it happened also before oslo.privsep 1.31.0.
I don't know since when it is like that but looks that recently it started to happens more often than in the past.
It looks for me like some race between threads/workers that one tries to read (or write sometimes) something from opened file descriptor but something else already closed it.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/630516

Revision history for this message
Miguel Lavalle (minsel) wrote : Re: SSH to FIP fails in CI jobs

I went carefully over the L3 agent logs (controller and compute) of recent failures of test_trunk_subport_lifecycle in neutron-tempest-plugin-dvr-multinode-scenario. While in each file I find one or two instances of the traceback you report above (for example http://logs.openstack.org/92/628492/2/check/neutron-tempest-plugin-dvr-multinode-scenario/8c51f22/controller/logs/screen-q-l3.txt.gz#_Jan_13_18_30_08_603008), I find hard to see how those few instances can be the cause of the many failed test cases in each execution of the job. I find several other more common errors in the log files such as:

http://logs.openstack.org/92/628492/2/check/neutron-tempest-plugin-dvr-multinode-scenario/8c51f22/controller/logs/screen-q-l3.txt.gz#_Jan_13_18_34_01_226132

http://logs.openstack.org/92/628492/2/check/neutron-tempest-plugin-dvr-multinode-scenario/8c51f22/controller/logs/screen-q-l3.txt.gz#_Jan_13_18_34_46_644426

http://logs.openstack.org/92/628492/2/check/neutron-tempest-plugin-dvr-multinode-scenario/8c51f22/compute1/logs/screen-q-l3.txt.gz#_Jan_13_18_19_45_585422

As a next step, I am going to relate any of these tracebacks to the failure of trunk_subport_lifecycle in neutron-tempest-plugin-dvr-multinode-scenario. In the mean time, I suggest you continue investigating the traceback you report in this bug. I want to point out that some of the instances I found of this bug (for example http://logs.openstack.org/92/628492/2/check/neutron-tempest-plugin-dvr-multinode-scenario/8c51f22/controller/logs/screen-q-l3.txt.gz#_Jan_13_18_30_08_603008) have a slightly different form indicating that it occurred trying to set up a floating IP. See if that gives you useful information

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

I think that issue which I mentioned is reason of failed tests when it happens during adding new interface to router, like in this example http://logs.openstack.org/43/628943/2/check/tempest-full/07fe3fe/controller/logs/screen-q-l3.txt.gz?level=ERROR#_Jan_13_17_57_18_302877 or something similar.
If it happens during e.g. removing router or listing devices it not cause failures.

That's why IMO it not happens always.

What I also found is that always just before such error, there is error "forced shutdown procedure, clean up netns manually" which comes from pyroute2: https://github.com/svinota/pyroute2/blob/ad992d14f47d82b39b9de3801210b51386fe42eb/pyroute2/netns/nslink.py#L199 and I suspect that maybe when one call is closing namespace and it fails, pyroute2 is doing something wrong and closes also other file descriptors which cause issue in other worker which wants to do something in namespace.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Is it possible to get an elastic-recheck query for this bug since it seems to be affecting a relatively high number of CI jobs.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Download full text (5.3 KiB)

Looking at some failures in logstash:

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22OSError%3A%20%5BErrno%209%5D%20Bad%20file%20descriptor%5C%22%20AND%20build_status%3A%5C%22FAILURE%5C%22

It appears this started around Jan 9. Looking at changes to neutron and upper-constraints for dependent package versions we test with, these stand out given there are privsep errors in the neutron l3 agent logs:

https://github.com/openstack/neutron/commit/c6d358d4c6926638fe9d5194e3da112c2750c6a4

https://github.com/openstack/requirements/commit/6b45c47e53b8820b68ff78eaec8062c4fdf05a56#diff-0bdd949ed8a7fdd4f95240bd951779c8

In fact, the upper-constraints change depends on the neutron change, so I'm guessing those are the roots of the problem and probably require some more debugging. We do see privsep failures in the l3 agent log:

http://logs.openstack.org/43/628943/2/check/tempest-full/07fe3fe/controller/logs/screen-q-l3.txt.gz#_Jan_13_17_57_16_887623

Jan 13 17:57:16.887623 ubuntu-bionic-vexxhost-sjc1-0001673055 neutron-l3-agent[4880]: DEBUG oslo.privsep.daemon [-] privsep: Exception during request[139657795965328]: 255 {{(pid=5905) _process_cmd /usr/local/lib/python2.7/dist-packages/oslo_privsep/daemon.py:452}}
Jan 13 17:57:16.887623 ubuntu-bionic-vexxhost-sjc1-0001673055 neutron-l3-agent[4880]: Traceback (most recent call last):
Jan 13 17:57:16.887623 ubuntu-bionic-vexxhost-sjc1-0001673055 neutron-l3-agent[4880]: File "/usr/local/lib/python2.7/dist-packages/oslo_privsep/daemon.py", line 447, in _process_cmd
Jan 13 17:57:16.887623 ubuntu-bionic-vexxhost-sjc1-0001673055 neutron-l3-agent[4880]: ret = func(*f_args, **f_kwargs)
Jan 13 17:57:16.887623 ubuntu-bionic-vexxhost-sjc1-0001673055 neutron-l3-agent[4880]: File "/usr/local/lib/python2.7/dist-packages/oslo_privsep/priv_context.py", line 216, in _wrap
Jan 13 17:57:16.887623 ubuntu-bionic-vexxhost-sjc1-0001673055 neutron-l3-agent[4880]: return func(*args, **kwargs)
Jan 13 17:57:16.887623 ubuntu-bionic-vexxhost-sjc1-0001673055 neutron-l3-agent[4880]: File "/opt/stack/neutron/neutron/privileged/agent/linux/ip_lib.py", line 331, in set_link_attribute
Jan 13 17:57:16.887623 ubuntu-bionic-vexxhost-sjc1-0001673055 neutron-l3-agent[4880]: return _run_iproute_link("set", device, namespace, **attributes)
Jan 13 17:57:16.887623 ubuntu-bionic-vexxhost-sjc1-0001673055 neutron-l3-agent[4880]: File "/opt/stack/neutron/neutron/privileged/agent/linux/ip_lib.py", line 194, in _run_iproute_link
Jan 13 17:57:16.887623 ubuntu-bionic-vexxhost-sjc1-0001673055 neutron-l3-agent[4880]: return ip.link(command, index=idx, **kwargs)
Jan 13 17:57:16.887623 ubuntu-bionic-vexxhost-sjc1-0001673055 neutron-l3-agent[4880]: File "/usr/local/lib/python2.7/dist-packages/pyroute2/iproute/linux.py", line 1115, in link
Jan 13 17:57:16.887623 ubuntu-bionic-vexxhost-sjc1-0001673055 neutron-l3-agent[4880]: msg_flags=msg_flags)
Jan 13 17:57:16.887623 ubuntu-bionic-vexxhost-sjc1-0001673055 neutron-l3-agent[4880]: File "/usr/local/lib/python2.7/dist-packages/pyroute2/netlink/nlsocket.py", line 352, in nlm_request
Jan 13 17:57:16.887623 ubuntu-bionic-vexxhost-sjc1-0001673055 neutron-...

Read more...

Revision history for this message
Miguel Lavalle (minsel) wrote :
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

@Matt: yes, I think that there is some issue in pyroute2 library when it's working in multithreading environment. And oslo.privsep 1.31.0 which was bumped on 9.01 has such new feature. So IMO this update of oslo.privsep was somehow trigger of issue which we wasn't aware before.

@Miguel:
Yes, I saw that issue also in DHCP client's logs. Basically DHCP agent is using same ip_lib module to do some operations in namespaces and that's why it happens also there.

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

I think I found some kind of reproducer of this issue. I run script like: http://paste.openstack.org/show/742683/ on Ubuntu 18.04 with kernel 4.15.0-43-generic and from time to time it happened that I got same OSError raised there.

Also Rodolfo was looking into it and he things that reason of issue are lines https://github.com/svinota/pyroute2/blob/master/pyroute2/netns/nslink.py#L146-L147 which closes file descriptors opened by parent process from child process. We have to check if it will work better without that.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/631075

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/631275

Changed in neutron:
assignee: nobody → Slawek Kaplonski (slaweq)
status: Confirmed → In Progress
Revision history for this message
Ben Nemec (bnemec) wrote : Re: SSH to FIP fails in CI jobs

Hmm, according to what I've read forking in threads is a bad thing to do. I wonder if we are going to need to provide a method in privsep to run calls in the main thread instead of threading. :-/

summary: - SSH to FIP fails in CI jobs
+ pyroute2.NetNS don't work properly with concurrency in oslo.privsep
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/631275
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=284afcf24f2dfd3af20a803ada79c69205249f25
Submitter: Zuul
Branch: master

commit 284afcf24f2dfd3af20a803ada79c69205249f25
Author: Slawek Kaplonski <email address hidden>
Date: Wed Jan 16 17:13:08 2019 +0100

    Lock privileged.agent.linux.ip_lib functions

    It is workaround of bug in pyroute2 library which, when
    running in multithread environment, sometimes have issues
    with NetNS class.
    When NetNS.__init__() is called, it uses os.pipe() function to
    create 2 file descriptors which are used to communicated between
    2 processes.
    In some cases when multiple threads are running it might happen
    that in two instances of NetNS() class there will be same file
    descriptors used and that leads to problems when one thread
    closes file descriptor and second still wants to use it.

    With this patch functions which uses instance of pyroute2.NetNS class
    are locked thus there shouldn't be risk of using same file descriptors
    in 2 separate threads.

    Co-Authored-By: Rodolfo Alonso Hernandez <email address hidden>

    Change-Id: Id5e6f2f8e9c31a7138da9cd6792e9d75845b81c5
    Closes-Bug: #1811515

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: master
Review: https://review.openstack.org/630516

tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: master
Review: https://review.openstack.org/631075

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 14.0.0.0b2

This issue was fixed in the openstack/neutron 14.0.0.0b2 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/640499

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/rocky)

Change abandoned by Nate Johnston (<email address hidden>) on branch: stable/rocky
Review: https://review.openstack.org/640499
Reason: Slawek makes a good point that this is not needed. Abandoning.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/657752

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/657752
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=d2d57371dc7dd35b1f66eb6390d96dfc512eeca1
Submitter: Zuul
Branch: master

commit d2d57371dc7dd35b1f66eb6390d96dfc512eeca1
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Thu May 9 13:19:11 2019 +0000

    Bump Pyroute2 version to 0.5.5

    Pyroute2 is modifying the default logger, adding NullHandler to it. If
    the logger is not properly configured, like in any service or agent
    spawn in Neutron, the only handler will be NullHandler. This prevents
    from rendering the message passed, as when StreamHandler is used.

    This issue is present only in version 0.5.4 and 0.5.5. Current Pyroute2
    master version implements a new logger which removes this problem.

    [1]https://github.com/svinota/pyroute2/blob/0.5.5/pyroute2/__init__.py#L78-L79

    Change-Id: Ic89d8503e9d1a7f622f0c3a148bf78766b92ed08
    Related-Bug: #1811515

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/666845

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/666845
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=8ecad3ca4b3ebb655440ad926de5029e0d2ccc1b
Submitter: Zuul
Branch: master

commit 8ecad3ca4b3ebb655440ad926de5029e0d2ccc1b
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Fri Jun 21 13:57:26 2019 +0000

    Remove some synchronization decorators from ip_lib

    Removed the synchronization from the following ip_lib functions:
    * create_netns
    * remove_netns
    * list_netns

    The mentioned bug #1811515 and the corresponding patch [1], explain
    the problem of Pyroute2 NetNS class and the file descriptors used
    in a multithread environment. This is why, until a fix is applied to
    this library, a synchronization decorator was applied to all Pyroute2
    commands.

    However, the listed functions do not instantiate this class. These three
    methods only handle the filesystem in order to create, remove or list
    the system namespaces. That means those methods are not affected by the
    mentioned bug in Pyroute2 and it is safe to execute them without any
    synchronization.

    [1] https://review.opendev.org/#/c/631275/

    Change-Id: I71521efbdaf9bb6d7cd9650f77c52bf6e02c9ced
    Closes-Bug: #1833717
    Related-Bug: #1811515

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/668682

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/669199

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/668682
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=eaf89a49fdc8774ea380133ac2da246cb93b3fe5
Submitter: Zuul
Branch: master

commit eaf89a49fdc8774ea380133ac2da246cb93b3fe5
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Tue Jul 2 14:22:19 2019 +0000

    Do not use privsep context when listing the namespaces in functional

    The aim of this patch is to speed up the namespace listing and
    avoid overloading the privsep daemon during the functional tests.

    Change-Id: Icd1de10a3a58d2c46d5400db95c2cd84809d6221
    Related-Bug: #1833717
    Related-Bug: #1811515

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/669199
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=9c5f394b73812fb748695eb0834e2b1b569b7876
Submitter: Zuul
Branch: master

commit 9c5f394b73812fb748695eb0834e2b1b569b7876
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Mon Jul 8 11:11:20 2019 +0000

    Do not use privsep context when listing the namespaces in fullstack

    The aim of this patch is to speed up the namespace listing and
    avoid overloading the privsep daemon during the fullstack tests.

    Change-Id: I95f024684c864aa75506b08a5f904a55267d09a8
    Related-Bug: #1833717
    Related-Bug: #1811515

tags: added: neutron-easy-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/679462

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/stein)

Reviewed: https://review.opendev.org/679462
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=dabab08597c0b9726b520a28c177e9c11682b5a5
Submitter: Zuul
Branch: stable/stein

commit dabab08597c0b9726b520a28c177e9c11682b5a5
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Mon Jul 8 11:11:20 2019 +0000

    Do not use privsep context when listing the namespaces in fullstack

    The aim of this patch is to speed up the namespace listing and
    avoid overloading the privsep daemon during the fullstack tests.

    Change-Id: I95f024684c864aa75506b08a5f904a55267d09a8
    Related-Bug: #1833717
    Related-Bug: #1811515
    (cherry picked from commit 9c5f394b73812fb748695eb0834e2b1b569b7876)

tags: added: in-stable-stein
tags: removed: neutron-easy-proactive-backport-potential neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/882376

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/882376
Committed: https://opendev.org/openstack/neutron/commit/6eb70068014c12ac887163463c84c9bf4fe398c4
Submitter: "Zuul (22348)"
Branch: master

commit 6eb70068014c12ac887163463c84c9bf4fe398c4
Author: yatinkarel <email address hidden>
Date: Fri May 5 12:08:16 2023 +0530

    Drop retries in tests for TimeoutException

    It was originally added due to [1] which is
    already reverted[2] as pyroute2>=0.5.5 included
    the fix, so the retries hack can also be removed.

    [1] https://review.opendev.org/c/openstack/neutron/+/631275
    [2] https://review.opendev.org/c/openstack/neutron/+/698854

    Related-Bug: #1811515
    Change-Id: I6d7b1c82c19d93c4a24f253bdc4c5b6164cfbaaa

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.