2020-09-23 08:58:32 |
Dmitrii Shcherbakov |
bug |
|
|
added bug |
2020-09-23 09:01:52 |
Dmitrii Shcherbakov |
description |
While trying to debug a different issue, I encountered a situation where privsep hangs in the process of handling a request from neutron-openvswitch-agent when debug logging is enabled (juju debug-log neutron-openvswitch=true):
https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/11
https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/12
The issue gets reproduced reliably in the environment where I encountered it on all units. As a result, neutron-openvswitch-agent services hang while waiting for a response from the privsep daemon and do not progress past basic initialization. They never post any state back to the Neutron server and thus are marked dead by it.
The processes though are shown as "active (running)" by systemd which adds to the confusion since they do indeed start from the systemd's perspective.
systemctl --no-pager status neutron-openvswitch-agent.service
● neutron-openvswitch-agent.service - Openstack Neutron Open vSwitch Plugin Agent
Loaded: loaded (/lib/systemd/system/neutron-openvswitch-agent.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2020-09-23 08:28:41 UTC; 25min ago
Main PID: 247772 (/usr/bin/python)
Tasks: 4 (limit: 9830)
CGroup: /system.slice/neutron-openvswitch-agent.service
├─247772 /usr/bin/python3 /usr/bin/neutron-openvswitch-agent --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/plugins/ml2/openvswitch_…og
└─248272 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini -…ck
--------------------------------------------------------
An strace shows that the privsep daemon tries to receive input from fd 3 which is the unix socket it uses to communicate with the client:
# there is one extra neutron-openvvswitch-agent running in a LXD container which can be ignored here (there is an octavia unit on the node which has a neutron-openvswitch subordinate)
root@node2:~# ps -eo pid,user,args --sort user | grep -P 'privsep.*openvswitch'
860690 100000 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmp910qakfk/privsep.sock
248272 root /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmpcmwn7vom/privsep.sock
363905 root grep --color=auto -P privsep.*openvswitch
root@node2:~# strace -f -p 248453 2>&1
[pid 248786] futex(0x7f6a6401c1d0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 248475] futex(0x7f6a6c024590, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 248473] futex(0x7f6a746d9fd0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 248453] recvfrom(3,
root@node2:~# lsof -p 248453 | grep 3u
privsep-h 248453 root 3u unix 0xffff8e6d8abdec00 0t0 356522977 type=STREAM
root@node2:~# ss -pax | grep 356522977
u_str ESTAB 0 0 /tmp/tmp2afa3enn/privsep.sock 356522978
* 356522977 users:(("/usr/bin/python",pid=247567,fd=16))
u_str ESTAB 0 0 * 356522977
* 356522978 users:(("privsep-helper",pid=248453,fd=3))
root@node2:~# lsof -p 247567 | grep 16u
/usr/bin/ 247567 neutron 16u unix 0xffff8e6d8abdb400 0t0 356522978 /tmp/tmp2afa3enn/privsep.sock type=STREAM |
While trying to debug a different issue, I encountered a situation where privsep hangs in the process of handling a request from neutron-openvswitch-agent when debug logging is enabled (juju debug-log neutron-openvswitch=true):
https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/11
https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/12
The issue gets reproduced reliably in the environment where I encountered it on all units. As a result, neutron-openvswitch-agent services hang while waiting for a response from the privsep daemon and do not progress past basic initialization. They never post any state back to the Neutron server and thus are marked dead by it.
The processes though are shown as "active (running)" by systemd which adds to the confusion since they do indeed start from the systemd's perspective.
systemctl --no-pager status neutron-openvswitch-agent.service
● neutron-openvswitch-agent.service - Openstack Neutron Open vSwitch Plugin Agent
Loaded: loaded (/lib/systemd/system/neutron-openvswitch-agent.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2020-09-23 08:28:41 UTC; 25min ago
Main PID: 247772 (/usr/bin/python)
Tasks: 4 (limit: 9830)
CGroup: /system.slice/neutron-openvswitch-agent.service
├─247772 /usr/bin/python3 /usr/bin/neutron-openvswitch-agent --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/plugins/ml2/openvswitch_…og
└─248272 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini -…ck
--------------------------------------------------------
An strace shows that the privsep daemon tries to receive input from fd 3 which is the unix socket it uses to communicate with the client. However, this is just one tread out of many spawned by the privsep daemon so it is unlikely to be the root cause.
# there is one extra neutron-openvvswitch-agent running in a LXD container which can be ignored here (there is an octavia unit on the node which has a neutron-openvswitch subordinate)
root@node2:~# ps -eo pid,user,args --sort user | grep -P 'privsep.*openvswitch'
860690 100000 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmp910qakfk/privsep.sock
248272 root /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmpcmwn7vom/privsep.sock
363905 root grep --color=auto -P privsep.*openvswitch
root@node2:~# strace -f -p 248453 2>&1
[pid 248786] futex(0x7f6a6401c1d0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 248475] futex(0x7f6a6c024590, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 248473] futex(0x7f6a746d9fd0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 248453] recvfrom(3,
root@node2:~# lsof -p 248453 | grep 3u
privsep-h 248453 root 3u unix 0xffff8e6d8abdec00 0t0 356522977 type=STREAM
root@node2:~# ss -pax | grep 356522977
u_str ESTAB 0 0 /tmp/tmp2afa3enn/privsep.sock 356522978
* 356522977 users:(("/usr/bin/python",pid=247567,fd=16))
u_str ESTAB 0 0 * 356522977
* 356522978 users:(("privsep-helper",pid=248453,fd=3))
root@node2:~# lsof -p 247567 | grep 16u
/usr/bin/ 247567 neutron 16u unix 0xffff8e6d8abdb400 0t0 356522978 /tmp/tmp2afa3enn/privsep.sock type=STREAM |
|
2020-09-23 09:04:24 |
Dmitrii Shcherbakov |
description |
While trying to debug a different issue, I encountered a situation where privsep hangs in the process of handling a request from neutron-openvswitch-agent when debug logging is enabled (juju debug-log neutron-openvswitch=true):
https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/11
https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/12
The issue gets reproduced reliably in the environment where I encountered it on all units. As a result, neutron-openvswitch-agent services hang while waiting for a response from the privsep daemon and do not progress past basic initialization. They never post any state back to the Neutron server and thus are marked dead by it.
The processes though are shown as "active (running)" by systemd which adds to the confusion since they do indeed start from the systemd's perspective.
systemctl --no-pager status neutron-openvswitch-agent.service
● neutron-openvswitch-agent.service - Openstack Neutron Open vSwitch Plugin Agent
Loaded: loaded (/lib/systemd/system/neutron-openvswitch-agent.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2020-09-23 08:28:41 UTC; 25min ago
Main PID: 247772 (/usr/bin/python)
Tasks: 4 (limit: 9830)
CGroup: /system.slice/neutron-openvswitch-agent.service
├─247772 /usr/bin/python3 /usr/bin/neutron-openvswitch-agent --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/plugins/ml2/openvswitch_…og
└─248272 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini -…ck
--------------------------------------------------------
An strace shows that the privsep daemon tries to receive input from fd 3 which is the unix socket it uses to communicate with the client. However, this is just one tread out of many spawned by the privsep daemon so it is unlikely to be the root cause.
# there is one extra neutron-openvvswitch-agent running in a LXD container which can be ignored here (there is an octavia unit on the node which has a neutron-openvswitch subordinate)
root@node2:~# ps -eo pid,user,args --sort user | grep -P 'privsep.*openvswitch'
860690 100000 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmp910qakfk/privsep.sock
248272 root /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmpcmwn7vom/privsep.sock
363905 root grep --color=auto -P privsep.*openvswitch
root@node2:~# strace -f -p 248453 2>&1
[pid 248786] futex(0x7f6a6401c1d0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 248475] futex(0x7f6a6c024590, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 248473] futex(0x7f6a746d9fd0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 248453] recvfrom(3,
root@node2:~# lsof -p 248453 | grep 3u
privsep-h 248453 root 3u unix 0xffff8e6d8abdec00 0t0 356522977 type=STREAM
root@node2:~# ss -pax | grep 356522977
u_str ESTAB 0 0 /tmp/tmp2afa3enn/privsep.sock 356522978
* 356522977 users:(("/usr/bin/python",pid=247567,fd=16))
u_str ESTAB 0 0 * 356522977
* 356522978 users:(("privsep-helper",pid=248453,fd=3))
root@node2:~# lsof -p 247567 | grep 16u
/usr/bin/ 247567 neutron 16u unix 0xffff8e6d8abdb400 0t0 356522978 /tmp/tmp2afa3enn/privsep.sock type=STREAM |
While trying to debug a different issue, I encountered a situation where privsep hangs in the process of handling a request from neutron-openvswitch-agent when debug logging is enabled (juju debug-log neutron-openvswitch=true):
https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/11
https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/12
The issue gets reproduced reliably in the environment where I encountered it on all units. As a result, neutron-openvswitch-agent services hang while waiting for a response from the privsep daemon and do not progress past basic initialization. They never post any state back to the Neutron server and thus are marked dead by it.
The processes though are shown as "active (running)" by systemd which adds to the confusion since they do indeed start from the systemd's perspective.
systemctl --no-pager status neutron-openvswitch-agent.service
● neutron-openvswitch-agent.service - Openstack Neutron Open vSwitch Plugin Agent
Loaded: loaded (/lib/systemd/system/neutron-openvswitch-agent.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2020-09-23 08:28:41 UTC; 25min ago
Main PID: 247772 (/usr/bin/python)
Tasks: 4 (limit: 9830)
CGroup: /system.slice/neutron-openvswitch-agent.service
├─247772 /usr/bin/python3 /usr/bin/neutron-openvswitch-agent --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/plugins/ml2/openvswitch_…og
└─248272 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini -…ck
--------------------------------------------------------
An strace shows that the privsep daemon tries to receive input from fd 3 which is the unix socket it uses to communicate with the client. However, this is just one tread out of many spawned by the privsep daemon so it is unlikely to be the root cause (there are 65 threads there in total, see https://paste.ubuntu.com/p/fbGvN2P8rP/)
# there is one extra neutron-openvvswitch-agent running in a LXD container which can be ignored here (there is an octavia unit on the node which has a neutron-openvswitch subordinate)
root@node2:~# ps -eo pid,user,args --sort user | grep -P 'privsep.*openvswitch'
860690 100000 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmp910qakfk/privsep.sock
248272 root /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmpcmwn7vom/privsep.sock
363905 root grep --color=auto -P privsep.*openvswitch
root@node2:~# strace -f -p 248453 2>&1
[pid 248786] futex(0x7f6a6401c1d0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 248475] futex(0x7f6a6c024590, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 248473] futex(0x7f6a746d9fd0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 248453] recvfrom(3,
root@node2:~# lsof -p 248453 | grep 3u
privsep-h 248453 root 3u unix 0xffff8e6d8abdec00 0t0 356522977 type=STREAM
root@node2:~# ss -pax | grep 356522977
u_str ESTAB 0 0 /tmp/tmp2afa3enn/privsep.sock 356522978
* 356522977 users:(("/usr/bin/python",pid=247567,fd=16))
u_str ESTAB 0 0 * 356522977
* 356522978 users:(("privsep-helper",pid=248453,fd=3))
root@node2:~# lsof -p 247567 | grep 16u
/usr/bin/ 247567 neutron 16u unix 0xffff8e6d8abdb400 0t0 356522978 /tmp/tmp2afa3enn/privsep.sock type=STREAM |
|
2020-09-23 12:16:39 |
Dmitrii Shcherbakov |
bug watch added |
|
https://bugzilla.redhat.com/show_bug.cgi?id=1862364 |
|
2020-09-24 05:54:03 |
Dmitrii Shcherbakov |
charm-neutron-openvswitch: status |
New |
Incomplete |
|
2020-09-24 14:29:20 |
Dmitrii Shcherbakov |
charm-neutron-openvswitch: status |
Incomplete |
New |
|
2020-09-24 19:42:55 |
Dmitrii Shcherbakov |
bug task added |
|
python-oslo.privsep (Ubuntu) |
|
2020-09-24 19:45:13 |
Dmitrii Shcherbakov |
summary |
A privsep daemon hangs when debug logging is enabled |
A privsep daemon spawned by neutron-openvswitch-agent hangs when debug logging is enabled (large number of registered NICs) - an RPC response is too large for msgpack |
|
2020-09-24 19:45:57 |
Dmitrii Shcherbakov |
charm-neutron-openvswitch: status |
New |
Invalid |
|
2020-09-24 19:51:57 |
Dmitrii Shcherbakov |
bug task added |
|
oslo.privsep |
|
2020-09-24 19:56:07 |
Dmitrii Shcherbakov |
description |
While trying to debug a different issue, I encountered a situation where privsep hangs in the process of handling a request from neutron-openvswitch-agent when debug logging is enabled (juju debug-log neutron-openvswitch=true):
https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/11
https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/12
The issue gets reproduced reliably in the environment where I encountered it on all units. As a result, neutron-openvswitch-agent services hang while waiting for a response from the privsep daemon and do not progress past basic initialization. They never post any state back to the Neutron server and thus are marked dead by it.
The processes though are shown as "active (running)" by systemd which adds to the confusion since they do indeed start from the systemd's perspective.
systemctl --no-pager status neutron-openvswitch-agent.service
● neutron-openvswitch-agent.service - Openstack Neutron Open vSwitch Plugin Agent
Loaded: loaded (/lib/systemd/system/neutron-openvswitch-agent.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2020-09-23 08:28:41 UTC; 25min ago
Main PID: 247772 (/usr/bin/python)
Tasks: 4 (limit: 9830)
CGroup: /system.slice/neutron-openvswitch-agent.service
├─247772 /usr/bin/python3 /usr/bin/neutron-openvswitch-agent --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/plugins/ml2/openvswitch_…og
└─248272 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini -…ck
--------------------------------------------------------
An strace shows that the privsep daemon tries to receive input from fd 3 which is the unix socket it uses to communicate with the client. However, this is just one tread out of many spawned by the privsep daemon so it is unlikely to be the root cause (there are 65 threads there in total, see https://paste.ubuntu.com/p/fbGvN2P8rP/)
# there is one extra neutron-openvvswitch-agent running in a LXD container which can be ignored here (there is an octavia unit on the node which has a neutron-openvswitch subordinate)
root@node2:~# ps -eo pid,user,args --sort user | grep -P 'privsep.*openvswitch'
860690 100000 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmp910qakfk/privsep.sock
248272 root /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmpcmwn7vom/privsep.sock
363905 root grep --color=auto -P privsep.*openvswitch
root@node2:~# strace -f -p 248453 2>&1
[pid 248786] futex(0x7f6a6401c1d0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 248475] futex(0x7f6a6c024590, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 248473] futex(0x7f6a746d9fd0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 248453] recvfrom(3,
root@node2:~# lsof -p 248453 | grep 3u
privsep-h 248453 root 3u unix 0xffff8e6d8abdec00 0t0 356522977 type=STREAM
root@node2:~# ss -pax | grep 356522977
u_str ESTAB 0 0 /tmp/tmp2afa3enn/privsep.sock 356522978
* 356522977 users:(("/usr/bin/python",pid=247567,fd=16))
u_str ESTAB 0 0 * 356522977
* 356522978 users:(("privsep-helper",pid=248453,fd=3))
root@node2:~# lsof -p 247567 | grep 16u
/usr/bin/ 247567 neutron 16u unix 0xffff8e6d8abdb400 0t0 356522978 /tmp/tmp2afa3enn/privsep.sock type=STREAM |
When there is a large amount of netdevs registered in the kernel and debug logging is enabled, neutron-openvswitch-agent and the privsep daemon spawned by it hang since the RPC call result sent by the privsep daemon over a unix socket exceeds the message sizes that the msgpack library can handle.
The impact of this is that enabling debug logging on the cloud completely stalls neutron-openvswitch-agents and makes them "dead" from the Neutron server perspective.
The issue is summarized in detail in comment #5 https://bugs.launchpad.net/oslo.privsep/+bug/1896734/comments/5
========================================================================
Old Description
While trying to debug a different issue, I encountered a situation where privsep hangs in the process of handling a request from neutron-openvswitch-agent when debug logging is enabled (juju debug-log neutron-openvswitch=true):
https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/11
https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/12
The issue gets reproduced reliably in the environment where I encountered it on all units. As a result, neutron-openvswitch-agent services hang while waiting for a response from the privsep daemon and do not progress past basic initialization. They never post any state back to the Neutron server and thus are marked dead by it.
The processes though are shown as "active (running)" by systemd which adds to the confusion since they do indeed start from the systemd's perspective.
systemctl --no-pager status neutron-openvswitch-agent.service
● neutron-openvswitch-agent.service - Openstack Neutron Open vSwitch Plugin Agent
Loaded: loaded (/lib/systemd/system/neutron-openvswitch-agent.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2020-09-23 08:28:41 UTC; 25min ago
Main PID: 247772 (/usr/bin/python)
Tasks: 4 (limit: 9830)
CGroup: /system.slice/neutron-openvswitch-agent.service
├─247772 /usr/bin/python3 /usr/bin/neutron-openvswitch-agent --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/plugins/ml2/openvswitch_…og
└─248272 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini -…ck
--------------------------------------------------------
An strace shows that the privsep daemon tries to receive input from fd 3 which is the unix socket it uses to communicate with the client. However, this is just one tread out of many spawned by the privsep daemon so it is unlikely to be the root cause (there are 65 threads there in total, see https://paste.ubuntu.com/p/fbGvN2P8rP/)
# there is one extra neutron-openvvswitch-agent running in a LXD container which can be ignored here (there is an octavia unit on the node which has a neutron-openvswitch subordinate)
root@node2:~# ps -eo pid,user,args --sort user | grep -P 'privsep.*openvswitch'
860690 100000 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmp910qakfk/privsep.sock
248272 root /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmpcmwn7vom/privsep.sock
363905 root grep --color=auto -P privsep.*openvswitch
root@node2:~# strace -f -p 248453 2>&1
[pid 248786] futex(0x7f6a6401c1d0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 248475] futex(0x7f6a6c024590, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 248473] futex(0x7f6a746d9fd0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 248453] recvfrom(3,
root@node2:~# lsof -p 248453 | grep 3u
privsep-h 248453 root 3u unix 0xffff8e6d8abdec00 0t0 356522977 type=STREAM
root@node2:~# ss -pax | grep 356522977
u_str ESTAB 0 0 /tmp/tmp2afa3enn/privsep.sock 356522978
* 356522977 users:(("/usr/bin/python",pid=247567,fd=16))
u_str ESTAB 0 0 * 356522977
* 356522978 users:(("privsep-helper",pid=248453,fd=3))
root@node2:~# lsof -p 247567 | grep 16u
/usr/bin/ 247567 neutron 16u unix 0xffff8e6d8abdb400 0t0 356522978 /tmp/tmp2afa3enn/privsep.sock type=STREAM |
|
2020-10-09 02:19:15 |
Trent Lloyd |
tags |
|
seg |
|
2020-10-09 05:54:03 |
Dominique Poulain |
bug |
|
|
added subscriber Dominique Poulain |
2020-11-25 09:55:53 |
Nobuto Murata |
bug |
|
|
added subscriber Nobuto Murata |
2021-02-24 11:52:01 |
Rodolfo Alonso |
bug task added |
|
neutron |
|
2021-02-24 11:52:13 |
Rodolfo Alonso |
neutron: importance |
Undecided |
Medium |
|
2021-02-24 11:52:17 |
Rodolfo Alonso |
neutron: assignee |
|
Rodolfo Alonso (rodolfo-alonso-hernandez) |
|
2021-03-09 09:43:06 |
Edward Hope-Morley |
neutron: status |
New |
Fix Released |
|
2021-03-09 21:39:04 |
Corey Bryant |
bug task added |
|
neutron (Ubuntu) |
|
2021-03-09 21:39:26 |
Corey Bryant |
nominated for series |
|
Ubuntu Focal |
|
2021-03-09 21:39:26 |
Corey Bryant |
bug task added |
|
neutron (Ubuntu Focal) |
|
2021-03-09 21:39:26 |
Corey Bryant |
bug task added |
|
python-oslo.privsep (Ubuntu Focal) |
|
2021-03-09 21:39:26 |
Corey Bryant |
nominated for series |
|
Ubuntu Hirsute |
|
2021-03-09 21:39:26 |
Corey Bryant |
bug task added |
|
neutron (Ubuntu Hirsute) |
|
2021-03-09 21:39:26 |
Corey Bryant |
bug task added |
|
python-oslo.privsep (Ubuntu Hirsute) |
|
2021-03-09 21:39:26 |
Corey Bryant |
nominated for series |
|
Ubuntu Groovy |
|
2021-03-09 21:39:26 |
Corey Bryant |
bug task added |
|
neutron (Ubuntu Groovy) |
|
2021-03-09 21:39:26 |
Corey Bryant |
bug task added |
|
python-oslo.privsep (Ubuntu Groovy) |
|
2021-03-09 21:40:02 |
Corey Bryant |
neutron (Ubuntu Focal): importance |
Undecided |
Medium |
|
2021-03-09 21:40:02 |
Corey Bryant |
neutron (Ubuntu Focal): status |
New |
Triaged |
|
2021-03-09 21:40:19 |
Corey Bryant |
neutron (Ubuntu Groovy): importance |
Undecided |
Medium |
|
2021-03-09 21:40:19 |
Corey Bryant |
neutron (Ubuntu Groovy): status |
New |
Triaged |
|
2021-03-09 21:40:33 |
Corey Bryant |
neutron (Ubuntu Hirsute): importance |
Undecided |
Medium |
|
2021-03-09 21:40:33 |
Corey Bryant |
neutron (Ubuntu Hirsute): status |
New |
Triaged |
|
2021-03-09 21:40:45 |
Corey Bryant |
bug task added |
|
cloud-archive |
|
2021-03-09 21:40:56 |
Corey Bryant |
nominated for series |
|
cloud-archive/ussuri |
|
2021-03-09 21:40:56 |
Corey Bryant |
bug task added |
|
cloud-archive/ussuri |
|
2021-03-09 21:40:56 |
Corey Bryant |
nominated for series |
|
cloud-archive/victoria |
|
2021-03-09 21:40:56 |
Corey Bryant |
bug task added |
|
cloud-archive/victoria |
|
2021-03-09 21:41:10 |
Corey Bryant |
cloud-archive/ussuri: importance |
Undecided |
Medium |
|
2021-03-09 21:41:10 |
Corey Bryant |
cloud-archive/ussuri: status |
New |
Triaged |
|
2021-03-09 21:41:24 |
Corey Bryant |
cloud-archive/victoria: importance |
Undecided |
Medium |
|
2021-03-09 21:41:24 |
Corey Bryant |
cloud-archive/victoria: status |
New |
Triaged |
|
2021-03-10 20:40:43 |
Launchpad Janitor |
neutron (Ubuntu Hirsute): status |
Triaged |
Fix Released |
|
2021-03-11 12:40:32 |
Łukasz Zemczak |
neutron (Ubuntu Groovy): status |
Triaged |
Fix Committed |
|
2021-03-11 12:40:40 |
Łukasz Zemczak |
bug |
|
|
added subscriber Ubuntu Stable Release Updates Team |
2021-03-11 12:40:43 |
Łukasz Zemczak |
bug |
|
|
added subscriber SRU Verification |
2021-03-11 12:40:48 |
Łukasz Zemczak |
tags |
seg |
seg verification-needed verification-needed-groovy |
|
2021-03-11 13:01:10 |
Łukasz Zemczak |
neutron (Ubuntu Focal): status |
Triaged |
Fix Committed |
|
2021-03-11 13:01:17 |
Łukasz Zemczak |
tags |
seg verification-needed verification-needed-groovy |
seg verification-needed verification-needed-focal verification-needed-groovy |
|
2021-03-11 18:07:53 |
Corey Bryant |
cloud-archive/victoria: status |
Triaged |
Fix Committed |
|
2021-03-11 18:07:56 |
Corey Bryant |
tags |
seg verification-needed verification-needed-focal verification-needed-groovy |
seg verification-needed verification-needed-focal verification-needed-groovy verification-victoria-needed |
|
2021-03-11 18:07:58 |
Corey Bryant |
cloud-archive/ussuri: status |
Triaged |
Fix Committed |
|
2021-03-18 15:34:06 |
Edward Hope-Morley |
description |
When there is a large amount of netdevs registered in the kernel and debug logging is enabled, neutron-openvswitch-agent and the privsep daemon spawned by it hang since the RPC call result sent by the privsep daemon over a unix socket exceeds the message sizes that the msgpack library can handle.
The impact of this is that enabling debug logging on the cloud completely stalls neutron-openvswitch-agents and makes them "dead" from the Neutron server perspective.
The issue is summarized in detail in comment #5 https://bugs.launchpad.net/oslo.privsep/+bug/1896734/comments/5
========================================================================
Old Description
While trying to debug a different issue, I encountered a situation where privsep hangs in the process of handling a request from neutron-openvswitch-agent when debug logging is enabled (juju debug-log neutron-openvswitch=true):
https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/11
https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/12
The issue gets reproduced reliably in the environment where I encountered it on all units. As a result, neutron-openvswitch-agent services hang while waiting for a response from the privsep daemon and do not progress past basic initialization. They never post any state back to the Neutron server and thus are marked dead by it.
The processes though are shown as "active (running)" by systemd which adds to the confusion since they do indeed start from the systemd's perspective.
systemctl --no-pager status neutron-openvswitch-agent.service
● neutron-openvswitch-agent.service - Openstack Neutron Open vSwitch Plugin Agent
Loaded: loaded (/lib/systemd/system/neutron-openvswitch-agent.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2020-09-23 08:28:41 UTC; 25min ago
Main PID: 247772 (/usr/bin/python)
Tasks: 4 (limit: 9830)
CGroup: /system.slice/neutron-openvswitch-agent.service
├─247772 /usr/bin/python3 /usr/bin/neutron-openvswitch-agent --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/plugins/ml2/openvswitch_…og
└─248272 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini -…ck
--------------------------------------------------------
An strace shows that the privsep daemon tries to receive input from fd 3 which is the unix socket it uses to communicate with the client. However, this is just one tread out of many spawned by the privsep daemon so it is unlikely to be the root cause (there are 65 threads there in total, see https://paste.ubuntu.com/p/fbGvN2P8rP/)
# there is one extra neutron-openvvswitch-agent running in a LXD container which can be ignored here (there is an octavia unit on the node which has a neutron-openvswitch subordinate)
root@node2:~# ps -eo pid,user,args --sort user | grep -P 'privsep.*openvswitch'
860690 100000 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmp910qakfk/privsep.sock
248272 root /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmpcmwn7vom/privsep.sock
363905 root grep --color=auto -P privsep.*openvswitch
root@node2:~# strace -f -p 248453 2>&1
[pid 248786] futex(0x7f6a6401c1d0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 248475] futex(0x7f6a6c024590, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 248473] futex(0x7f6a746d9fd0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 248453] recvfrom(3,
root@node2:~# lsof -p 248453 | grep 3u
privsep-h 248453 root 3u unix 0xffff8e6d8abdec00 0t0 356522977 type=STREAM
root@node2:~# ss -pax | grep 356522977
u_str ESTAB 0 0 /tmp/tmp2afa3enn/privsep.sock 356522978
* 356522977 users:(("/usr/bin/python",pid=247567,fd=16))
u_str ESTAB 0 0 * 356522977
* 356522978 users:(("privsep-helper",pid=248453,fd=3))
root@node2:~# lsof -p 247567 | grep 16u
/usr/bin/ 247567 neutron 16u unix 0xffff8e6d8abdb400 0t0 356522978 /tmp/tmp2afa3enn/privsep.sock type=STREAM |
[Impact]
When there is a large amount of netdevs registered in the kernel and debug logging is enabled, neutron-openvswitch-agent and the privsep daemon spawned by it hang since the RPC call result sent by the privsep daemon over a unix socket exceeds the message sizes that the msgpack library can handle.
The impact of this is that enabling debug logging on the cloud completely stalls neutron-openvswitch-agents and makes them "dead" from the Neutron server perspective.
The issue is summarized in detail in comment #5 https://bugs.launchpad.net/oslo.privsep/+bug/1896734/comments/5
[Test Plan]
* deploy Openstack Train/Ussuri/Victoria
* need at least one compute host
* enable neutron debug logging
* create a load of interfaces on your compute host to create a large 'ip addr show' output
* for ((i=0;i<400;i++)); do ip tuntap add mode tap tap-`uuidgen| cut -c1-11`; done
* create a single vm
* add floating ip
* ping fip
* create 20 ports and attach them to the vm
* for ((i=0;i<20;i++)); do id=`uuidgen`; openstack port create --network private --security-group ab698dc1-4b87-4b46-9c38-cd53e3ec7492 X-$id; openstack server add port X-$id; done
* attaching ports should not result in errors
[Where problems could occur]
No problems anticipated this patchset.
========================================================================
When there is a large amount of netdevs registered in the kernel and debug logging is enabled, neutron-openvswitch-agent and the privsep daemon spawned by it hang since the RPC call result sent by the privsep daemon over a unix socket exceeds the message sizes that the msgpack library can handle.
The impact of this is that enabling debug logging on the cloud completely stalls neutron-openvswitch-agents and makes them "dead" from the Neutron server perspective.
The issue is summarized in detail in comment #5 https://bugs.launchpad.net/oslo.privsep/+bug/1896734/comments/5
========================================================================
Old Description
While trying to debug a different issue, I encountered a situation where privsep hangs in the process of handling a request from neutron-openvswitch-agent when debug logging is enabled (juju debug-log neutron-openvswitch=true):
https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/11
https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/12
The issue gets reproduced reliably in the environment where I encountered it on all units. As a result, neutron-openvswitch-agent services hang while waiting for a response from the privsep daemon and do not progress past basic initialization. They never post any state back to the Neutron server and thus are marked dead by it.
The processes though are shown as "active (running)" by systemd which adds to the confusion since they do indeed start from the systemd's perspective.
systemctl --no-pager status neutron-openvswitch-agent.service
● neutron-openvswitch-agent.service - Openstack Neutron Open vSwitch Plugin Agent
Loaded: loaded (/lib/systemd/system/neutron-openvswitch-agent.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2020-09-23 08:28:41 UTC; 25min ago
Main PID: 247772 (/usr/bin/python)
Tasks: 4 (limit: 9830)
CGroup: /system.slice/neutron-openvswitch-agent.service
├─247772 /usr/bin/python3 /usr/bin/neutron-openvswitch-agent --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/plugins/ml2/openvswitch_…og
└─248272 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini -…ck
--------------------------------------------------------
An strace shows that the privsep daemon tries to receive input from fd 3 which is the unix socket it uses to communicate with the client. However, this is just one tread out of many spawned by the privsep daemon so it is unlikely to be the root cause (there are 65 threads there in total, see https://paste.ubuntu.com/p/fbGvN2P8rP/)
# there is one extra neutron-openvvswitch-agent running in a LXD container which can be ignored here (there is an octavia unit on the node which has a neutron-openvswitch subordinate)
root@node2:~# ps -eo pid,user,args --sort user | grep -P 'privsep.*openvswitch'
860690 100000 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmp910qakfk/privsep.sock
248272 root /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmpcmwn7vom/privsep.sock
363905 root grep --color=auto -P privsep.*openvswitch
root@node2:~# strace -f -p 248453 2>&1
[pid 248786] futex(0x7f6a6401c1d0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 248475] futex(0x7f6a6c024590, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 248473] futex(0x7f6a746d9fd0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 248453] recvfrom(3,
root@node2:~# lsof -p 248453 | grep 3u
privsep-h 248453 root 3u unix 0xffff8e6d8abdec00 0t0 356522977 type=STREAM
root@node2:~# ss -pax | grep 356522977
u_str ESTAB 0 0 /tmp/tmp2afa3enn/privsep.sock 356522978
* 356522977 users:(("/usr/bin/python",pid=247567,fd=16))
u_str ESTAB 0 0 * 356522977
* 356522978 users:(("privsep-helper",pid=248453,fd=3))
root@node2:~# lsof -p 247567 | grep 16u
/usr/bin/ 247567 neutron 16u unix 0xffff8e6d8abdb400 0t0 356522978 /tmp/tmp2afa3enn/privsep.sock type=STREAM |
|
2021-03-18 15:35:21 |
Edward Hope-Morley |
tags |
seg verification-needed verification-needed-focal verification-needed-groovy verification-victoria-needed |
seg verification-done-groovy verification-needed verification-needed-focal verification-victoria-needed |
|
2021-03-18 19:52:52 |
Edward Hope-Morley |
description |
[Impact]
When there is a large amount of netdevs registered in the kernel and debug logging is enabled, neutron-openvswitch-agent and the privsep daemon spawned by it hang since the RPC call result sent by the privsep daemon over a unix socket exceeds the message sizes that the msgpack library can handle.
The impact of this is that enabling debug logging on the cloud completely stalls neutron-openvswitch-agents and makes them "dead" from the Neutron server perspective.
The issue is summarized in detail in comment #5 https://bugs.launchpad.net/oslo.privsep/+bug/1896734/comments/5
[Test Plan]
* deploy Openstack Train/Ussuri/Victoria
* need at least one compute host
* enable neutron debug logging
* create a load of interfaces on your compute host to create a large 'ip addr show' output
* for ((i=0;i<400;i++)); do ip tuntap add mode tap tap-`uuidgen| cut -c1-11`; done
* create a single vm
* add floating ip
* ping fip
* create 20 ports and attach them to the vm
* for ((i=0;i<20;i++)); do id=`uuidgen`; openstack port create --network private --security-group ab698dc1-4b87-4b46-9c38-cd53e3ec7492 X-$id; openstack server add port X-$id; done
* attaching ports should not result in errors
[Where problems could occur]
No problems anticipated this patchset.
========================================================================
When there is a large amount of netdevs registered in the kernel and debug logging is enabled, neutron-openvswitch-agent and the privsep daemon spawned by it hang since the RPC call result sent by the privsep daemon over a unix socket exceeds the message sizes that the msgpack library can handle.
The impact of this is that enabling debug logging on the cloud completely stalls neutron-openvswitch-agents and makes them "dead" from the Neutron server perspective.
The issue is summarized in detail in comment #5 https://bugs.launchpad.net/oslo.privsep/+bug/1896734/comments/5
========================================================================
Old Description
While trying to debug a different issue, I encountered a situation where privsep hangs in the process of handling a request from neutron-openvswitch-agent when debug logging is enabled (juju debug-log neutron-openvswitch=true):
https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/11
https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/12
The issue gets reproduced reliably in the environment where I encountered it on all units. As a result, neutron-openvswitch-agent services hang while waiting for a response from the privsep daemon and do not progress past basic initialization. They never post any state back to the Neutron server and thus are marked dead by it.
The processes though are shown as "active (running)" by systemd which adds to the confusion since they do indeed start from the systemd's perspective.
systemctl --no-pager status neutron-openvswitch-agent.service
● neutron-openvswitch-agent.service - Openstack Neutron Open vSwitch Plugin Agent
Loaded: loaded (/lib/systemd/system/neutron-openvswitch-agent.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2020-09-23 08:28:41 UTC; 25min ago
Main PID: 247772 (/usr/bin/python)
Tasks: 4 (limit: 9830)
CGroup: /system.slice/neutron-openvswitch-agent.service
├─247772 /usr/bin/python3 /usr/bin/neutron-openvswitch-agent --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/plugins/ml2/openvswitch_…og
└─248272 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini -…ck
--------------------------------------------------------
An strace shows that the privsep daemon tries to receive input from fd 3 which is the unix socket it uses to communicate with the client. However, this is just one tread out of many spawned by the privsep daemon so it is unlikely to be the root cause (there are 65 threads there in total, see https://paste.ubuntu.com/p/fbGvN2P8rP/)
# there is one extra neutron-openvvswitch-agent running in a LXD container which can be ignored here (there is an octavia unit on the node which has a neutron-openvswitch subordinate)
root@node2:~# ps -eo pid,user,args --sort user | grep -P 'privsep.*openvswitch'
860690 100000 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmp910qakfk/privsep.sock
248272 root /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmpcmwn7vom/privsep.sock
363905 root grep --color=auto -P privsep.*openvswitch
root@node2:~# strace -f -p 248453 2>&1
[pid 248786] futex(0x7f6a6401c1d0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 248475] futex(0x7f6a6c024590, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 248473] futex(0x7f6a746d9fd0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 248453] recvfrom(3,
root@node2:~# lsof -p 248453 | grep 3u
privsep-h 248453 root 3u unix 0xffff8e6d8abdec00 0t0 356522977 type=STREAM
root@node2:~# ss -pax | grep 356522977
u_str ESTAB 0 0 /tmp/tmp2afa3enn/privsep.sock 356522978
* 356522977 users:(("/usr/bin/python",pid=247567,fd=16))
u_str ESTAB 0 0 * 356522977
* 356522978 users:(("privsep-helper",pid=248453,fd=3))
root@node2:~# lsof -p 247567 | grep 16u
/usr/bin/ 247567 neutron 16u unix 0xffff8e6d8abdb400 0t0 356522978 /tmp/tmp2afa3enn/privsep.sock type=STREAM |
[Impact]
When there is a large amount of netdevs registered in the kernel and debug logging is enabled, neutron-openvswitch-agent and the privsep daemon spawned by it hang since the RPC call result sent by the privsep daemon over a unix socket exceeds the message sizes that the msgpack library can handle.
The impact of this is that enabling debug logging on the cloud completely stalls neutron-openvswitch-agents and makes them "dead" from the Neutron server perspective.
The issue is summarized in detail in comment #5 https://bugs.launchpad.net/oslo.privsep/+bug/1896734/comments/5
[Test Plan]
* deploy Openstack Train/Ussuri/Victoria
* need at least one compute host
* enable neutron debug logging
* create a load of interfaces on your compute host to create a large 'ip addr show' output
* for ((i=0;i<400;i++)); do ip tuntap add mode tap tap-`uuidgen| cut -c1-11`; done
* create a single vm
* add floating ip
* ping fip
* create 20 ports and attach them to the vm
* for ((i=0;i<20;i++)); do id=`uuidgen`; openstack port create --network private --security-group __SG__ X-$id; openstack server add port __VM__ X-$id; done
* attaching ports should not result in errors
[Where problems could occur]
No problems anticipated this patchset.
========================================================================
When there is a large amount of netdevs registered in the kernel and debug logging is enabled, neutron-openvswitch-agent and the privsep daemon spawned by it hang since the RPC call result sent by the privsep daemon over a unix socket exceeds the message sizes that the msgpack library can handle.
The impact of this is that enabling debug logging on the cloud completely stalls neutron-openvswitch-agents and makes them "dead" from the Neutron server perspective.
The issue is summarized in detail in comment #5 https://bugs.launchpad.net/oslo.privsep/+bug/1896734/comments/5
========================================================================
Old Description
While trying to debug a different issue, I encountered a situation where privsep hangs in the process of handling a request from neutron-openvswitch-agent when debug logging is enabled (juju debug-log neutron-openvswitch=true):
https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/11
https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/12
The issue gets reproduced reliably in the environment where I encountered it on all units. As a result, neutron-openvswitch-agent services hang while waiting for a response from the privsep daemon and do not progress past basic initialization. They never post any state back to the Neutron server and thus are marked dead by it.
The processes though are shown as "active (running)" by systemd which adds to the confusion since they do indeed start from the systemd's perspective.
systemctl --no-pager status neutron-openvswitch-agent.service
● neutron-openvswitch-agent.service - Openstack Neutron Open vSwitch Plugin Agent
Loaded: loaded (/lib/systemd/system/neutron-openvswitch-agent.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2020-09-23 08:28:41 UTC; 25min ago
Main PID: 247772 (/usr/bin/python)
Tasks: 4 (limit: 9830)
CGroup: /system.slice/neutron-openvswitch-agent.service
├─247772 /usr/bin/python3 /usr/bin/neutron-openvswitch-agent --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/plugins/ml2/openvswitch_…og
└─248272 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini -…ck
--------------------------------------------------------
An strace shows that the privsep daemon tries to receive input from fd 3 which is the unix socket it uses to communicate with the client. However, this is just one tread out of many spawned by the privsep daemon so it is unlikely to be the root cause (there are 65 threads there in total, see https://paste.ubuntu.com/p/fbGvN2P8rP/)
# there is one extra neutron-openvvswitch-agent running in a LXD container which can be ignored here (there is an octavia unit on the node which has a neutron-openvswitch subordinate)
root@node2:~# ps -eo pid,user,args --sort user | grep -P 'privsep.*openvswitch'
860690 100000 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmp910qakfk/privsep.sock
248272 root /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmpcmwn7vom/privsep.sock
363905 root grep --color=auto -P privsep.*openvswitch
root@node2:~# strace -f -p 248453 2>&1
[pid 248786] futex(0x7f6a6401c1d0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 248475] futex(0x7f6a6c024590, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 248473] futex(0x7f6a746d9fd0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 248453] recvfrom(3,
root@node2:~# lsof -p 248453 | grep 3u
privsep-h 248453 root 3u unix 0xffff8e6d8abdec00 0t0 356522977 type=STREAM
root@node2:~# ss -pax | grep 356522977
u_str ESTAB 0 0 /tmp/tmp2afa3enn/privsep.sock 356522978
* 356522977 users:(("/usr/bin/python",pid=247567,fd=16))
u_str ESTAB 0 0 * 356522977
* 356522978 users:(("privsep-helper",pid=248453,fd=3))
root@node2:~# lsof -p 247567 | grep 16u
/usr/bin/ 247567 neutron 16u unix 0xffff8e6d8abdb400 0t0 356522978 /tmp/tmp2afa3enn/privsep.sock type=STREAM |
|
2021-03-18 20:06:34 |
Edward Hope-Morley |
tags |
seg verification-done-groovy verification-needed verification-needed-focal verification-victoria-needed |
seg verification-done-groovy verification-needed verification-needed-focal verification-victoria-done |
|
2021-03-18 22:08:59 |
Edward Hope-Morley |
tags |
seg verification-done-groovy verification-needed verification-needed-focal verification-victoria-done |
seg verification-done-focal verification-done-groovy verification-needed verification-victoria-done |
|
2021-03-22 19:20:53 |
Corey Bryant |
tags |
seg verification-done-focal verification-done-groovy verification-needed verification-victoria-done |
seg verification-done verification-done-focal verification-done-groovy verification-victoria-done |
|
2021-03-23 23:42:22 |
Brian Murray |
removed subscriber Ubuntu Stable Release Updates Team |
|
|
|
2021-03-23 23:43:57 |
Launchpad Janitor |
neutron (Ubuntu Focal): status |
Fix Committed |
Fix Released |
|
2021-03-23 23:52:18 |
Launchpad Janitor |
neutron (Ubuntu Groovy): status |
Fix Committed |
Fix Released |
|
2021-03-24 07:15:42 |
Chris MacNaughton |
cloud-archive/victoria: status |
Fix Committed |
Fix Released |
|
2021-03-29 14:06:12 |
Chris MacNaughton |
tags |
seg verification-done verification-done-focal verification-done-groovy verification-victoria-done |
seg verification-done verification-done-focal verification-done-groovy verification-ussuri-needed verification-victoria-done |
|
2021-03-31 18:52:55 |
Edward Hope-Morley |
tags |
seg verification-done verification-done-focal verification-done-groovy verification-ussuri-needed verification-victoria-done |
seg verification-done verification-done-focal verification-done-groovy verification-ussuri-done verification-victoria-done |
|
2021-04-01 12:55:53 |
Chris MacNaughton |
cloud-archive/ussuri: status |
Fix Committed |
Fix Released |
|
2021-04-06 13:34:50 |
Corey Bryant |
cloud-archive: status |
Fix Committed |
Fix Released |
|
2021-07-21 13:44:50 |
Christian Rohmann |
bug |
|
|
added subscriber Christian Rohmann |
2021-07-28 23:18:09 |
Brian Murray |
python-oslo.privsep (Ubuntu Groovy): status |
New |
Won't Fix |
|
2021-12-01 07:07:19 |
Dr. Jens Harbott |
bug |
|
|
added subscriber Dr. Jens Harbott |
2021-12-01 07:25:55 |
OpenStack Infra |
oslo.privsep: status |
New |
In Progress |
|
2022-03-22 14:10:11 |
OpenStack Infra |
oslo.privsep: status |
In Progress |
Fix Released |
|
2023-08-04 12:37:41 |
Edward Hope-Morley |
nominated for series |
|
Ubuntu Jammy |
|
2023-08-04 12:37:41 |
Edward Hope-Morley |
bug task added |
|
neutron (Ubuntu Jammy) |
|
2023-08-04 12:37:41 |
Edward Hope-Morley |
bug task added |
|
python-oslo.privsep (Ubuntu Jammy) |
|
2023-08-04 12:38:09 |
Edward Hope-Morley |
bug task deleted |
python-oslo.privsep (Ubuntu Groovy) |
|
|
2023-08-04 12:38:14 |
Edward Hope-Morley |
bug task deleted |
python-oslo.privsep (Ubuntu Hirsute) |
|
|