Bug #1933813 “OSError: Premature eof waiting for privileged proc...” : Bugs : neutron

Revision history for this message

Herve Beraud (herveberaud) wrote on 2021-06-28:

#1

Hello

Just by curiosity which version of python are you using?

Ussuri support 2 versions for oslo.privsep which depends-on the python version used.

https://opendev.org/openstack/requirements/src/branch/stable/ussuri/upper-constraints.txt#L816-L819

Revision history for this message

Rodolfo Alonso (rodolfo-alonso-hernandez) wrote on 2021-06-28:

#2

Hello Vadim:

I was going to ask something similar to Herve. If you updated to Ussuri, maybe you are executing the agent code in py3 but the privsep daemon in py27.

I would find privsep-helper in you system and see what python version is using this script:
$ which privsep-helper
/usr/local/bin/privsep-helper
$ cat /usr/local/bin/privsep-helper
#!/usr/bin/python <-- you should check that this python version is the same as the client.
# PBR Generated from 'console_scripts'

import sys

from oslo_privsep.daemon import helper_main

if __name__ == "__main__":
sys.exit(helper_main())

BTW, we test this oslo.privsep version in the CI and we didn't notice the problem you have. Did you change the oslo.privsep configuration? Does your oslo.privsep daemon user have root permissions? What do you see in the oslo.privsep daemon logs?

Regards.

Revision history for this message

Vadim Ponomarev (velizarx) wrote on 2021-06-28:

#3

We are run OpenStack in Kubernetes and uses loci image [1] as basic images. Inside the image we are using python 3.6.9. Results of checking version:

$ which privsep-helper
/var/lib/openstack/bin/privsep-helper

$ cat /var/lib/openstack/bin/privsep-helper
#!/var/lib/openstack/bin/python3
# -*- coding: utf-8 -*-
import re
import sys
from oslo_privsep.daemon import helper_main
if __name__ == '__main__':
sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])
sys.exit(helper_main())

$ /var/lib/openstack/bin/python3 --version
Python 3.6.9

I understand that we are using already old release and that is, most likely if the bug was someone would have already found it. But it's really strange.

[1] https://github.com/sapcc/loci

Revision history for this message

Vadim Ponomarev (velizarx) wrote on 2021-06-29:

#4

We are using default privsep config ([privsep] section does not exist in neutron.conf file). In the logs we see:

2021-06-29 09:45:18,068.068 1081 INFO oslo.privsep.daemon [-] privsep daemon starting
2021-06-29 09:45:18,072.072 1081 INFO oslo.privsep.daemon [-] privsep process running with uid/gid: 0/0
2021-06-29 09:45:18,074.074 1081 INFO oslo.privsep.daemon [-] privsep process running with capabilities (eff/prm/inh): CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_NET_ADMIN|CAP_SYS_ADMIN|CAP_SYS_PTRACE/CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_NET_ADMIN|CAP_SYS_ADMIN|CAP_SYS_PTRACE/none
2021-06-29 09:45:19,469.469 6 INFO oslo.privsep.daemon [req-89fa26f0-4ef1-43fc-82a7-4f840c9efb19 - - - - -] Spawned new privsep daemon via rootwrap
2021-06-29 09:45:19,470.470 6 DEBUG oslo.privsep.daemon [req-89fa26f0-4ef1-43fc-82a7-4f840c9efb19 - - - - -] Accepted privsep connection to /tmp/tmp_2c2c_og/privsep.sock __init__ /var/lib/openstack/lib/python3.6/site-packages/oslo_privsep/daemon.py:371

Nothing related to the errors.

Akihiro Motoki (amotoki) on 2021-06-29

tags:

added: linuxbridge privsep

Akihiro Motoki (amotoki) on 2021-06-29

description:

updated

Revision history for this message

Rodolfo Alonso (rodolfo-alonso-hernandez) wrote on 2021-06-29:

#5

Hello Vadim:

What you posted is the log of the client, for example the LB agent (but could be any other service using privileged methods). What we need is the privsep daemon logs.

Privsep spawns a daemon with root privileges. Then, all the privileges but the required ones are dropped. Between the client (the LB agent for example) and the root daemon there is a bidirectional socket to request function executions and receive the result.

The logs you have submitted are the client logs. So far, the daemon is spawned correctly but it ends prematurely when executing a privileged method. We need to find what the error is and why is happening.

You can reproduce it by opening a python console, using the same user that executes the agents. Then you can import "neutron.agent.linux.ip_lib" and execute a command. For example, a trivial one:
$ ip_lib.list_ip_routes(None, 4) # retrieve all IPv4 routes in kernel namespace

That will:
- Spawn a prisep daemon.
- Send a ping to test the daemon is communicating well
- Execute the method [1] --> this is when you can read the daemon logs to see if the execution is correct

You can also try to add new debug lines around [2]. This is where the method is executed (this is calling [1] inside the privsep root context).

So far, what I see in the logs is that any execution is failing but I don't know the reason.

Ping me if you need more info.

Regards.

[1]https://github.com/openstack/neutron/blob/master/neutron/privileged/agent/linux/ip_lib.py#L719
[2]https://github.com/openstack/oslo.privsep/blob/f7f3349d6a4def52f810ab1728879521c12fe2d0/oslo_privsep/daemon.py#L477

Revision history for this message

Miguel Lavalle (minsel) wrote on 2021-06-29:

#6

Marking it incomplete for now until we receive more info from submitter

Changed in neutron:
status:	New → Incomplete

Revision history for this message

Vadim Ponomarev (velizarx) wrote on 2021-07-01 (last edit on 2021-07-01):

#7

Rodolfo thank you for the detailed explanation, it was really helpful. Results of testing:

root@neutron-network-agent-a-0:/# ps aux | tee
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 944 72 ? Ss 13:18 0:00 dumb-init neutron-linuxbridge-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini --config-file /etc/neutron/linux-bridge.ini
root 6 1.6 0.0 325636 142960 ? Ss 13:18 7:35 /var/lib/openstack/bin/python3 /var/lib/openstack/bin/neutron-linuxbridge-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini --config-file /etc/neutron/linux-bridge.ini
root 22 0.0 0.0 312524 28896 ? Sl 13:18 0:07 /var/lib/openstack/bin/python3 /var/lib/openstack/bin/neutron-rootwrap-daemon /etc/neutron/rootwrap.conf
root 14576 0.0 0.0 19116 4120 pts/0 Ss 20:51 0:00 bash
root 14674 0.0 0.0 34488 2888 pts/0 R+ 20:54 0:00 ps aux
root 14675 0.0 0.0 4620 884 pts/0 S+ 20:54 0:00 tee

root@neutron-network-agent-a-0:/# /var/lib/openstack/bin/python3
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from neutron.agent.linux import ip_lib
>>> ip_lib.list_ip_routes(None, 4)
[{'table': 'main', 'source_prefix': None, 'cidr': '0.0.0.0/0', 'scope': 'global', 'device': 'eth0', 'via': '45.65.0.1', 'metric': 0}, {'table': 'main', 'source_prefix': '45.65.0.122', 'cidr': '45.65.0.0/23', 'scope': 'link', 'device': 'eth0', 'via': None, 'metric': 0}]
>>>

Now we find that neutron-linuxbridge-agent invokes oom-kill (every time OOM kills privsep daemon), and as result, we see this error. Somehow neutron-linuxbridge-agents are slower in Ussuri than they were in Queens. As a result, we observe increased memory consumption and long synchronization and in the end oom-killing.

We are debugging this situation now.

Rodolfo thank you for the detailed explanation, it was really helpful. Results of testing:

root@neutron-network-agent-a-0:/# ps aux | tee
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0    944    72 ?        Ss   13:18   0:00 dumb-init neutron-linuxbridge-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini --config-file /etc/neutron/linux-bridge.ini
root           6  1.6  0.0 325636 142960 ?       Ss   13:18   7:35 /var/lib/openstack/bin/python3 /var/lib/openstack/bin/neutron-linuxbridge-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini --config-file /etc/neutron/linux-bridge.ini
root          22  0.0  0.0 312524 28896 ?        Sl   13:18   0:07 /var/lib/openstack/bin/python3 /var/lib/openstack/bin/neutron-rootwrap-daemon /etc/neutron/rootwrap.conf
root       14576  0.0  0.0  19116  4120 pts/0    Ss   20:51   0:00 bash
root       14674  0.0  0.0  34488  2888 pts/0    R+   20:54   0:00 ps aux
root       14675  0.0  0.0   4620   884 pts/0    S+   20:54   0:00 tee

root@neutron-network-agent-a-0:/# /var/lib/openstack/bin/python3
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from neutron.agent.linux import ip_lib
>>> ip_lib.list_ip_routes(None, 4)
[{'table': 'main', 'source_prefix': None, 'cidr': '0.0.0.0/0', 'scope': 'global', 'device': 'eth0', 'via': '45.65.0.1', 'metric': 0}, {'table': 'main', 'source_prefix': '45.65.0.122', 'cidr': '45.65.0.0/23', 'scope': 'link', 'device': 'eth0', 'via': None, 'metric': 0}]
>>>

Now we find that neutron-linuxbridge-agent invokes oom-kill (every time OOM kills privsep daemon), and as result, we see this error. Somehow neutron-linuxbridge-agents are slower in Ussuri than they were in Queens. As a result, we observe increased memory consumption and long synchronization and in the end oom-killing.

We are debugging this situation now.

Revision history for this message

Miguel Lavalle (minsel) wrote on 2021-07-04:

#8

Still waiting for more information from submitter. I am going to draw attention to this bug in the weekly bugs report, so it doesn't fall through the cracks

Revision history for this message

Vadim Ponomarev (velizarx) wrote on 2021-07-06:

#9

After a few days of debugging and experimentation, it became clear that the problem was multithreading. More precisely the default multithreading setting (multiprocessing.cpu_count()) means the number of cores on the host. This default setting works unexpectable in Kubernetes environments because it returns all cores from the host system, but the deployment limit limits their use. In our case, the default value of thread_pool_size was 32 but Kubernetes limit was set to 500m (0.5 CPU). As result 32 threads were stuck on IO operations and blocked each other.

Now we have played with the settings and found the optimal values for thread_pool_size and Kubernetes limits. This bug can be closed because it's not a bug it's a feature ;-) how Privsep works inside Kubernetes. Thanks for the help and detailed explanations.

Vadim Ponomarev (velizarx) on 2021-07-06

Changed in neutron:
status:	Incomplete → Invalid

neutron

OSError: Premature eof waiting for privileged process error in Ussuri release of Neutron

Bug Description

Other bug subscribers

Remote bug watches