neutron-functional tests failing with oslo.privsep 1.31

Bug #1810518 reported by Slawek Kaplonski on 2019-01-04
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Critical
Ben Nemec
oslo.privsep
Critical
Ben Nemec

Bug Description

Example of failure: http://logs.openstack.org/99/597599/7/gate/neutron-functional/87a8d1c/logs/testr_results.html.gz

It's similar error in all tests:

DEBUG [stevedore.extension] found extension EntryPoint.parse('kombu = oslo_messaging._drivers.impl_rabbit:RabbitDriver')
   DEBUG [stevedore.extension] found extension EntryPoint.parse('fake = oslo_messaging._drivers.impl_fake:FakeDriver')
   DEBUG [stevedore.extension] found extension EntryPoint.parse('rabbit = oslo_messaging._drivers.impl_rabbit:RabbitDriver')
   DEBUG [stevedore.extension] found extension EntryPoint.parse('amqp = oslo_messaging._drivers.impl_amqp1:ProtonDriver')
   DEBUG [stevedore.extension] found extension EntryPoint.parse('kafka = oslo_messaging._drivers.impl_kafka:KafkaDriver')
   DEBUG [stevedore.extension] found extension EntryPoint.parse('kombu = oslo_messaging._drivers.impl_rabbit:RabbitDriver')
   DEBUG [stevedore.extension] found extension EntryPoint.parse('fake = oslo_messaging._drivers.impl_fake:FakeDriver')
   DEBUG [stevedore.extension] found extension EntryPoint.parse('rabbit = oslo_messaging._drivers.impl_rabbit:RabbitDriver')
   DEBUG [stevedore.extension] found extension EntryPoint.parse('amqp = oslo_messaging._drivers.impl_amqp1:ProtonDriver')
   DEBUG [stevedore.extension] found extension EntryPoint.parse('kafka = oslo_messaging._drivers.impl_kafka:KafkaDriver')
   DEBUG [stevedore.extension] found extension EntryPoint.parse('kombu = oslo_messaging._drivers.impl_rabbit:RabbitDriver')
   DEBUG [stevedore.extension] found extension EntryPoint.parse('fake = oslo_messaging._drivers.impl_fake:FakeDriver')
   DEBUG [stevedore.extension] found extension EntryPoint.parse('rabbit = oslo_messaging._drivers.impl_rabbit:RabbitDriver')
   DEBUG [stevedore.extension] found extension EntryPoint.parse('amqp = oslo_messaging._drivers.impl_amqp1:ProtonDriver')
   DEBUG [stevedore.extension] found extension EntryPoint.parse('kafka = oslo_messaging._drivers.impl_kafka:KafkaDriver')
   DEBUG [stevedore.extension] found extension EntryPoint.parse('kombu = oslo_messaging._drivers.impl_rabbit:RabbitDriver')
   DEBUG [stevedore.extension] found extension EntryPoint.parse('fake = oslo_messaging._drivers.impl_fake:FakeDriver')
   DEBUG [stevedore.extension] found extension EntryPoint.parse('rabbit = oslo_messaging._drivers.impl_rabbit:RabbitDriver')
   DEBUG [stevedore.extension] found extension EntryPoint.parse('amqp = oslo_messaging._drivers.impl_amqp1:ProtonDriver')
   DEBUG [stevedore.extension] found extension EntryPoint.parse('kafka = oslo_messaging._drivers.impl_kafka:KafkaDriver')
   DEBUG [stevedore.extension] found extension EntryPoint.parse('neutron = neutron.conf.policies:list_rules')
   DEBUG [neutron.policy] Loaded default policies from ['neutron'] under neutron.policies entry points
   DEBUG [oslo_policy.policy] The policy file policy.json could not be found.
}}}

traceback-1: {{{
Traceback (most recent call last):
  File "/opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/fixtures/fixture.py", line 208, in setUp
    raise SetupError(details)
fixtures.fixture.SetupError: {}
}}}

traceback-2: {{{
Traceback (most recent call last):
  File "/opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/fixtures/fixture.py", line 208, in setUp
    raise SetupError(details)
fixtures.fixture.SetupError: {}
}}}

traceback-3: {{{
Traceback (most recent call last):
  File "/opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/fixtures/fixture.py", line 208, in setUp
    raise SetupError(details)
fixtures.fixture.SetupError: {}
}}}

Traceback (most recent call last):
  File "/opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/fixtures/fixture.py", line 197, in setUp
    self._setUp()
  File "neutron/tests/common/net_helpers.py", line 571, in _setUp
    self.ip_wrapper = ip.ensure_namespace(self.name)
  File "neutron/agent/linux/ip_lib.py", line 224, in ensure_namespace
    if not self.netns.exists(name):
  File "neutron/agent/linux/ip_lib.py", line 778, in exists
    return network_namespace_exists(name)
  File "neutron/agent/linux/ip_lib.py", line 982, in network_namespace_exists
    output = list_network_namespaces(**kwargs)
  File "neutron/agent/linux/ip_lib.py", line 971, in list_network_namespaces
    return privileged.list_netns(**kwargs)
  File "/opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/oslo_privsep/priv_context.py", line 214, in _wrap
    return self.channel.remote_call(name, args, kwargs)
  File "/opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/oslo_privsep/daemon.py", line 194, in remote_call
    result = self.send_recv((Message.CALL.value, name, args, kwargs))
  File "/opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/oslo_privsep/comm.py", line 163, in send_recv
    self.writer.send((myid, msg))
  File "/opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/oslo_privsep/comm.py", line 55, in send
    self.writesock.sendall(buf)
  File "/opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/eventlet/greenio/base.py", line 401, in sendall
    tail = self.send(data, flags)
  File "/opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/eventlet/greenio/base.py", line 395, in send
    return self._send_loop(self.fd.send, data, flags)
  File "/opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/eventlet/greenio/base.py", line 382, in _send_loop
    return send_method(data, *args)
socket.error: [Errno 32] Broken pipe

Changed in neutron:
assignee: nobody → Slawek Kaplonski (slaweq)
Slawek Kaplonski (slaweq) wrote :

Looks like this is issue with oslo.privsep 1.31.0 which introduced parallel calls. See http://lists.openstack.org/pipermail/openstack-discuss/2019-January/001357.html for details

Ben Nemec (bnemec) wrote :

Just a note for anyone else who might be looking into this too, it looks like you need the dsvm-functional tox target to run these tests. The privsep tests get skipped in the plain functional target. I'm still working on getting a test environment for this since it requires a bit more setup than just running tox.

Ben Nemec (bnemec) on 2019-01-04
Changed in oslo.privsep:
status: New → Confirmed
importance: Undecided → Critical
assignee: nobody → Ben Nemec (bnemec)
Ben Nemec (bnemec) wrote :

The good news is I can reproduce this locally, the bad news is that I can't consistently do it when I run subsets of the functional tests which means it takes a rather long time to try anything. Given that behavior, my guess would be that there's an issue when we recycle threads in the thread pool.

I'm also trying to set up a functional test in oslo.privsep to see if I can reproduce this in a simpler environment by throwing a lot of simultaneous calls at the daemon. Struggling a bit to get the daemon started in a tox venv though.

summary: - neutron-functional tests failing with policy.json file not found
+ neutron-functional tests failing with oslo.privsep 1.31
Slawek Kaplonski (slaweq) wrote :
Download full text (3.4 KiB)

I found that problem is only with tests from neutron.tests.functional.agent.linux.test_netlink_lib.NetlinkLibTestCase module.

Every time first test from this module is run by worker, it fails with error like:

Captured traceback:
~~~~~~~~~~~~~~~~~~~
    Traceback (most recent call last):
      File "neutron/tests/base.py", line 150, in func
        return f(self, *args, **kwargs)
      File "neutron/tests/functional/agent/linux/test_netlink_lib.py", line 96, in test_delete_icmp_entry
        self._delete_entry(icmp_entry, remain_entries, _zone)
      File "neutron/tests/functional/agent/linux/test_netlink_lib.py", line 56, in _delete_entry
        nl_lib.delete_entries(entries=delete_entries)
      File "/opt/stack/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/oslo_privsep/priv_context.py", line 214, in _wrap
        return self.channel.remote_call(name, args, kwargs)
      File "/opt/stack/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/oslo_privsep/daemon.py", line 194, in remote_call
        result = self.send_recv((Message.CALL.value, name, args, kwargs))
      File "/opt/stack/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/oslo_privsep/comm.py", line 165, in send_recv
        reply = future.result()
      File "/opt/stack/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/oslo_privsep/comm.py", line 108, in result
        raise self.error
    IOError: Premature eof waiting for privileged process

And then all other tests which are using privsep lib are failing with:

Captured traceback:
~~~~~~~~~~~~~~~~~~~
    Traceback (most recent call last):
      File "neutron/tests/base.py", line 150, in func
        return f(self, *args, **kwargs)
      File "neutron/tests/functional/agent/linux/test_netlink_lib.py", line 116, in test_delete_udp_entry
        self._delete_entry(udp_entry, remain_entries, _zone)
      File "neutron/tests/functional/agent/linux/test_netlink_lib.py", line 56, in _delete_entry
        nl_lib.delete_entries(entries=delete_entries)
      File "/opt/stack/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/oslo_privsep/priv_context.py", line 214, in _wrap
        return self.channel.remote_call(name, args, kwargs)
      File "/opt/stack/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/oslo_privsep/daemon.py", line 194, in remote_call
        result = self.send_recv((Message.CALL.value, name, args, kwargs))
      File "/opt/stack/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/oslo_privsep/comm.py", line 163, in send_recv
        self.writer.send((myid, msg))
      File "/opt/stack/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/oslo_privsep/comm.py", line 55, in send
        self.writesock.sendall(buf)
      File "/opt/stack/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/eventlet/greenio/base.py", line 401, in sendall
        tail = self.send(data, flags)
      File "/opt/stack/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/eventlet/greenio/base.py", line 395, in send
        return self._send_loop(self.fd.send, data, flags)
      File "/opt/stack/neutron/.tox/dsvm-functional...

Read more...

Slawek Kaplonski (slaweq) wrote :

I checked locally and this issue is caused clearly by https://github.com/openstack/oslo.privsep/commit/f368430f13183c3d09d68c80184d854449da2e87 in oslo.privsep.
When I reverted it, tests passed.

Ben Nemec (bnemec) wrote :

Okay, that's very helpful. It looks like privsep-helper is segfaulting in libnetfilter_conntrack:

privsep-helper[2985]: segfault at 78001b60 ip 00007f97775e70b6 sp 00007f977c98a468 error 6 in libnetfilter_conntrack.so.3.7.0[7f97775de000+1e000]

That kills the privsep daemon and all subsequent calls fail with the broken pipe error.

Ben Nemec (bnemec) wrote :

Looking a bit further, it seems like the C callback being used in the netlink_lib module in Neutron my not play nicely with threads. I've found some suggestions that when a C callback is used with a Python thread that the Python thread needs to be registered somehow. Still digging on that, but wanted to provide an update in case anyone more familiar with this than me has a suggestion.

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: master
Review: https://review.openstack.org/628439
Reason: Already changed in requirements repo

Slawek Kaplonski (slaweq) wrote :

I just found that when I changed method used to start PrivsepDaemon from Method.ROOTWRAP to Method.FORK (in https://github.com/openstack/oslo.privsep/blob/master/oslo_privsep/priv_context.py#L218) and run test as root, then tests were passed.

Otherwise test is failing with segfault exactly in libnetfilter_conntrack.so when it's called from: https://github.com/openstack/neutron/blob/master/neutron/privileged/agent/linux/netlink_lib.py#L191

I don't think it's matter of keeping reference to C callback. I know that You were trying something like that and it wasn't successful. I was also trying to e.g. disable gc in this module and I had exactly the same issue still.

Fix proposed to branch: master
Review: https://review.openstack.org/629335

Changed in neutron:
assignee: Slawek Kaplonski (slaweq) → Ben Nemec (bnemec)
status: Confirmed → In Progress
Ben Nemec (bnemec) on 2019-01-09
Changed in oslo.privsep:
status: Confirmed → Invalid

Reviewed: https://review.openstack.org/629335
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=c6d358d4c6926638fe9d5194e3da112c2750c6a4
Submitter: Zuul
Branch: master

commit c6d358d4c6926638fe9d5194e3da112c2750c6a4
Author: Ben Nemec <email address hidden>
Date: Tue Jan 8 23:59:10 2019 +0000

    Define types for C calls in netlink_lib

    Previously this was not done, which meant all arguments were
    assumed to be ints. As long as we didn't get any large pointer
    addresses this worked fine, but for some reason the addition of
    threading to oslo.privsep triggered larger addresses that were then
    truncated. This caused segfaults in the underlying C library because
    we were passing it invalid pointers.

    This change sets argument and return types for all of the calls
    that are used in the module.

    Change-Id: I9dbbb01eac8cf7cded7794f8ba69797d6357eeec
    Closes-Bug: 1810518

Changed in neutron:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers