A privsep daemon spawned by neutron-openvswitch-agent hangs when debug logging is enabled (large number of registered NICs) - an RPC response is too large for msgpack
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Neutron Open vSwitch Charm |
Invalid
|
Undecided
|
Unassigned | ||
Ubuntu Cloud Archive |
Fix Released
|
Medium
|
Unassigned | ||
Ussuri |
Fix Released
|
Medium
|
Unassigned | ||
Victoria |
Fix Released
|
Medium
|
Unassigned | ||
neutron |
Fix Released
|
Medium
|
Rodolfo Alonso | ||
oslo.privsep |
Fix Released
|
Undecided
|
Unassigned | ||
neutron (Ubuntu) |
Fix Released
|
Medium
|
Unassigned | ||
Focal |
Fix Released
|
Medium
|
Unassigned | ||
Groovy |
Fix Released
|
Medium
|
Unassigned | ||
Hirsute |
Fix Released
|
Medium
|
Unassigned | ||
Jammy |
New
|
Undecided
|
Unassigned | ||
python-oslo.privsep (Ubuntu) |
New
|
Undecided
|
Unassigned | ||
Focal |
New
|
Undecided
|
Unassigned | ||
Jammy |
New
|
Undecided
|
Unassigned |
Bug Description
[Impact]
When there is a large amount of netdevs registered in the kernel and debug logging is enabled, neutron-
The impact of this is that enabling debug logging on the cloud completely stalls neutron-
The issue is summarized in detail in comment #5 https:/
[Test Plan]
* deploy Openstack Train/Ussuri/
* need at least one compute host
* enable neutron debug logging
* create a load of interfaces on your compute host to create a large 'ip addr show' output
* for ((i=0;i<400;i++)); do ip tuntap add mode tap tap-`uuidgen| cut -c1-11`; done
* create a single vm
* add floating ip
* ping fip
* create 20 ports and attach them to the vm
* for ((i=0;i<20;i++)); do id=`uuidgen`; openstack port create --network private --security-group __SG__ X-$id; openstack server add port __VM__ X-$id; done
* attaching ports should not result in errors
[Where problems could occur]
No problems anticipated this patchset.
=======
When there is a large amount of netdevs registered in the kernel and debug logging is enabled, neutron-
The impact of this is that enabling debug logging on the cloud completely stalls neutron-
The issue is summarized in detail in comment #5 https:/
=======
Old Description
While trying to debug a different issue, I encountered a situation where privsep hangs in the process of handling a request from neutron-
https:/
https:/
The issue gets reproduced reliably in the environment where I encountered it on all units. As a result, neutron-
The processes though are shown as "active (running)" by systemd which adds to the confusion since they do indeed start from the systemd's perspective.
systemctl --no-pager status neutron-
● neutron-
Loaded: loaded (/lib/systemd/
Active: active (running) since Wed 2020-09-23 08:28:41 UTC; 25min ago
Main PID: 247772 (/usr/bin/python)
Tasks: 4 (limit: 9830)
CGroup: /system.
├─247772 /usr/bin/python3 /usr/bin/
└─248272 /usr/bin/python3 /usr/bin/
-------
An strace shows that the privsep daemon tries to receive input from fd 3 which is the unix socket it uses to communicate with the client. However, this is just one tread out of many spawned by the privsep daemon so it is unlikely to be the root cause (there are 65 threads there in total, see https:/
# there is one extra neutron-
root@node2:~# ps -eo pid,user,args --sort user | grep -P 'privsep.
860690 100000 /usr/bin/python3 /usr/bin/
248272 root /usr/bin/python3 /usr/bin/
363905 root grep --color=auto -P privsep.
root@node2:~# strace -f -p 248453 2>&1
[pid 248786] futex(0x7f6a640
[pid 248475] futex(0x7f6a6c0
[pid 248473] futex(0x7f6a746
[pid 248453] recvfrom(3,
root@node2:~# lsof -p 248453 | grep 3u
privsep-h 248453 root 3u unix 0xffff8e6d8abdec00 0t0 356522977 type=STREAM
root@node2:~# ss -pax | grep 356522977
u_str ESTAB 0 0 /tmp/tmp2afa3en
u_str ESTAB 0 0 * 356522977
root@node2:~# lsof -p 247567 | grep 16u
/usr/bin/ 247567 neutron 16u unix 0xffff8e6d8abdb400 0t0 356522978 /tmp/tmp2afa3en
description: | updated |
description: | updated |
Changed in charm-neutron-openvswitch: | |
status: | Incomplete → New |
tags: | added: seg |
Changed in neutron: | |
importance: | Undecided → Medium |
assignee: | nobody → Rodolfo Alonso (rodolfo-alonso-hernandez) |
Changed in neutron (Ubuntu Focal): | |
importance: | Undecided → Medium |
status: | New → Triaged |
Changed in neutron (Ubuntu Groovy): | |
importance: | Undecided → Medium |
status: | New → Triaged |
Changed in neutron (Ubuntu Hirsute): | |
importance: | Undecided → Medium |
status: | New → Triaged |
description: | updated |
tags: |
added: verification-done removed: verification-needed |
Changed in cloud-archive: | |
status: | Fix Committed → Fix Released |
no longer affects: | python-oslo.privsep (Ubuntu Groovy) |
no longer affects: | python-oslo.privsep (Ubuntu Hirsute) |
While the below bugs describe similar symptoms, they describe the case where the privsep daemon is forked from the client process
https:/ /bugs.launchpad .net/oslo. privsep/ +bug/1887506 /bugzilla. redhat. com/show_ bug.cgi? id=1862364
https:/
while in our case it gets started via the /usr/bin/ privsep- helper script in a child process:
CGroup: /system. slice/neutron- openvswitch- agent.service neutron- openvswitch- agent --config- file=/etc/ neutron/ neutron. conf --config- file=/etc/ neutron/ plugins/ ml2/openvswitch _age privsep- helper --config-file /etc/neutron/ neutron. conf --config-file /etc/neutron/ plugins/ ml2/openvswitch _agent. ini --pr
├─108898 /usr/bin/python3 /usr/bin/
└─109233 /usr/bin/python3 /usr/bin/