OSError: Premature eof waiting for privileged process error in Ussuri release of Neutron

Bug #1933813 reported by Vadim Ponomarev
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Invalid
Undecided
Unassigned

Bug Description

After updating Neutron from Queens to Ussuri we started getting the errors "OSError: Premature eof waiting for privileged process" for all of neutron agents and for different operations (create network, enable/disable dhcp etc.). All errors raises inside librari oslo.privsep which was updated from version 1.27.0 to 2.1.2 [1]. This problem is floating and very difficult to debug, it is impossible to understand what exactly is happening. The problem is consistently reproduced by tempest tests in our infrastructure.

Environment:
Neutron Ussuri, ML2 plugin + linux-bridge + custom agent HPB.

The examples of tracebacks: http://paste.openstack.org/show/807020/
(amotoki - paste.openstack.org is easier to read compared to launchpad description, so I moved the example traceback to paste.openstack.org.)

[1] https://github.com/openstack/oslo.privsep/compare/1.27.0...2.1.2

Revision history for this message
Herve Beraud (herveberaud) wrote :

Hello

Just by curiosity which version of python are you using?

Ussuri support 2 versions for oslo.privsep which depends-on the python version used.

https://opendev.org/openstack/requirements/src/branch/stable/ussuri/upper-constraints.txt#L816-L819

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello Vadim:

I was going to ask something similar to Herve. If you updated to Ussuri, maybe you are executing the agent code in py3 but the privsep daemon in py27.

I would find privsep-helper in you system and see what python version is using this script:
$ which privsep-helper
/usr/local/bin/privsep-helper
$ cat /usr/local/bin/privsep-helper
#!/usr/bin/python <-- you should check that this python version is the same as the client.
# PBR Generated from 'console_scripts'

import sys

from oslo_privsep.daemon import helper_main

if __name__ == "__main__":
    sys.exit(helper_main())

BTW, we test this oslo.privsep version in the CI and we didn't notice the problem you have. Did you change the oslo.privsep configuration? Does your oslo.privsep daemon user have root permissions? What do you see in the oslo.privsep daemon logs?

Regards.

Revision history for this message
Vadim Ponomarev (velizarx) wrote :

We are run OpenStack in Kubernetes and uses loci image [1] as basic images. Inside the image we are using python 3.6.9. Results of checking version:

$ which privsep-helper
/var/lib/openstack/bin/privsep-helper

$ cat /var/lib/openstack/bin/privsep-helper
#!/var/lib/openstack/bin/python3
# -*- coding: utf-8 -*-
import re
import sys
from oslo_privsep.daemon import helper_main
if __name__ == '__main__':
    sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])
    sys.exit(helper_main())

$ /var/lib/openstack/bin/python3 --version
Python 3.6.9

I understand that we are using already old release and that is, most likely if the bug was someone would have already found it. But it's really strange.

[1] https://github.com/sapcc/loci

Revision history for this message
Vadim Ponomarev (velizarx) wrote :

We are using default privsep config ([privsep] section does not exist in neutron.conf file). In the logs we see:

2021-06-29 09:45:18,068.068 1081 INFO oslo.privsep.daemon [-] privsep daemon starting
2021-06-29 09:45:18,072.072 1081 INFO oslo.privsep.daemon [-] privsep process running with uid/gid: 0/0
2021-06-29 09:45:18,074.074 1081 INFO oslo.privsep.daemon [-] privsep process running with capabilities (eff/prm/inh): CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_NET_ADMIN|CAP_SYS_ADMIN|CAP_SYS_PTRACE/CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_NET_ADMIN|CAP_SYS_ADMIN|CAP_SYS_PTRACE/none
2021-06-29 09:45:19,469.469 6 INFO oslo.privsep.daemon [req-89fa26f0-4ef1-43fc-82a7-4f840c9efb19 - - - - -] Spawned new privsep daemon via rootwrap
2021-06-29 09:45:19,470.470 6 DEBUG oslo.privsep.daemon [req-89fa26f0-4ef1-43fc-82a7-4f840c9efb19 - - - - -] Accepted privsep connection to /tmp/tmp_2c2c_og/privsep.sock __init__ /var/lib/openstack/lib/python3.6/site-packages/oslo_privsep/daemon.py:371

Nothing related to the errors.

Akihiro Motoki (amotoki)
tags: added: linuxbridge privsep
Akihiro Motoki (amotoki)
description: updated
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello Vadim:

What you posted is the log of the client, for example the LB agent (but could be any other service using privileged methods). What we need is the privsep daemon logs.

Privsep spawns a daemon with root privileges. Then, all the privileges but the required ones are dropped. Between the client (the LB agent for example) and the root daemon there is a bidirectional socket to request function executions and receive the result.

The logs you have submitted are the client logs. So far, the daemon is spawned correctly but it ends prematurely when executing a privileged method. We need to find what the error is and why is happening.

You can reproduce it by opening a python console, using the same user that executes the agents. Then you can import "neutron.agent.linux.ip_lib" and execute a command. For example, a trivial one:
$ ip_lib.list_ip_routes(None, 4) # retrieve all IPv4 routes in kernel namespace

That will:
- Spawn a prisep daemon.
- Send a ping to test the daemon is communicating well
- Execute the method [1] --> this is when you can read the daemon logs to see if the execution is correct

You can also try to add new debug lines around [2]. This is where the method is executed (this is calling [1] inside the privsep root context).

So far, what I see in the logs is that any execution is failing but I don't know the reason.

Ping me if you need more info.

Regards.

[1]https://github.com/openstack/neutron/blob/master/neutron/privileged/agent/linux/ip_lib.py#L719
[2]https://github.com/openstack/oslo.privsep/blob/f7f3349d6a4def52f810ab1728879521c12fe2d0/oslo_privsep/daemon.py#L477

Revision history for this message
Miguel Lavalle (minsel) wrote :

Marking it incomplete for now until we receive more info from submitter

Changed in neutron:
status: New → Incomplete
Revision history for this message
Vadim Ponomarev (velizarx) wrote (last edit ):

Rodolfo thank you for the detailed explanation, it was really helpful. Results of testing:

root@neutron-network-agent-a-0:/# ps aux | tee
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 944 72 ? Ss 13:18 0:00 dumb-init neutron-linuxbridge-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini --config-file /etc/neutron/linux-bridge.ini
root 6 1.6 0.0 325636 142960 ? Ss 13:18 7:35 /var/lib/openstack/bin/python3 /var/lib/openstack/bin/neutron-linuxbridge-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini --config-file /etc/neutron/linux-bridge.ini
root 22 0.0 0.0 312524 28896 ? Sl 13:18 0:07 /var/lib/openstack/bin/python3 /var/lib/openstack/bin/neutron-rootwrap-daemon /etc/neutron/rootwrap.conf
root 14576 0.0 0.0 19116 4120 pts/0 Ss 20:51 0:00 bash
root 14674 0.0 0.0 34488 2888 pts/0 R+ 20:54 0:00 ps aux
root 14675 0.0 0.0 4620 884 pts/0 S+ 20:54 0:00 tee

root@neutron-network-agent-a-0:/# /var/lib/openstack/bin/python3
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from neutron.agent.linux import ip_lib
>>> ip_lib.list_ip_routes(None, 4)
[{'table': 'main', 'source_prefix': None, 'cidr': '0.0.0.0/0', 'scope': 'global', 'device': 'eth0', 'via': '45.65.0.1', 'metric': 0}, {'table': 'main', 'source_prefix': '45.65.0.122', 'cidr': '45.65.0.0/23', 'scope': 'link', 'device': 'eth0', 'via': None, 'metric': 0}]
>>>

Now we find that neutron-linuxbridge-agent invokes oom-kill (every time OOM kills privsep daemon), and as result, we see this error. Somehow neutron-linuxbridge-agents are slower in Ussuri than they were in Queens. As a result, we observe increased memory consumption and long synchronization and in the end oom-killing.

We are debugging this situation now.

Revision history for this message
Miguel Lavalle (minsel) wrote :

Still waiting for more information from submitter. I am going to draw attention to this bug in the weekly bugs report, so it doesn't fall through the cracks

Revision history for this message
Vadim Ponomarev (velizarx) wrote :

After a few days of debugging and experimentation, it became clear that the problem was multithreading. More precisely the default multithreading setting (multiprocessing.cpu_count()) means the number of cores on the host. This default setting works unexpectable in Kubernetes environments because it returns all cores from the host system, but the deployment limit limits their use. In our case, the default value of thread_pool_size was 32 but Kubernetes limit was set to 500m (0.5 CPU). As result 32 threads were stuck on IO operations and blocked each other.

Now we have played with the settings and found the optimal values for thread_pool_size and Kubernetes limits. This bug can be closed because it's not a bug it's a feature ;-) how Privsep works inside Kubernetes. Thanks for the help and detailed explanations.

Changed in neutron:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.