Activity log for bug #1896734

Date Who What changed Old value New value Message
2020-09-23 08:58:32 Dmitrii Shcherbakov bug added bug
2020-09-23 09:01:52 Dmitrii Shcherbakov description While trying to debug a different issue, I encountered a situation where privsep hangs in the process of handling a request from neutron-openvswitch-agent when debug logging is enabled (juju debug-log neutron-openvswitch=true): https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/11 https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/12 The issue gets reproduced reliably in the environment where I encountered it on all units. As a result, neutron-openvswitch-agent services hang while waiting for a response from the privsep daemon and do not progress past basic initialization. They never post any state back to the Neutron server and thus are marked dead by it. The processes though are shown as "active (running)" by systemd which adds to the confusion since they do indeed start from the systemd's perspective. systemctl --no-pager status neutron-openvswitch-agent.service ● neutron-openvswitch-agent.service - Openstack Neutron Open vSwitch Plugin Agent Loaded: loaded (/lib/systemd/system/neutron-openvswitch-agent.service; enabled; vendor preset: enabled) Active: active (running) since Wed 2020-09-23 08:28:41 UTC; 25min ago Main PID: 247772 (/usr/bin/python) Tasks: 4 (limit: 9830) CGroup: /system.slice/neutron-openvswitch-agent.service ├─247772 /usr/bin/python3 /usr/bin/neutron-openvswitch-agent --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/plugins/ml2/openvswitch_…og └─248272 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini -…ck -------------------------------------------------------- An strace shows that the privsep daemon tries to receive input from fd 3 which is the unix socket it uses to communicate with the client: # there is one extra neutron-openvvswitch-agent running in a LXD container which can be ignored here (there is an octavia unit on the node which has a neutron-openvswitch subordinate) root@node2:~# ps -eo pid,user,args --sort user | grep -P 'privsep.*openvswitch' 860690 100000 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmp910qakfk/privsep.sock 248272 root /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmpcmwn7vom/privsep.sock 363905 root grep --color=auto -P privsep.*openvswitch root@node2:~# strace -f -p 248453 2>&1 [pid 248786] futex(0x7f6a6401c1d0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...> [pid 248475] futex(0x7f6a6c024590, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...> [pid 248473] futex(0x7f6a746d9fd0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...> [pid 248453] recvfrom(3, root@node2:~# lsof -p 248453 | grep 3u privsep-h 248453 root 3u unix 0xffff8e6d8abdec00 0t0 356522977 type=STREAM root@node2:~# ss -pax | grep 356522977 u_str ESTAB 0 0 /tmp/tmp2afa3enn/privsep.sock 356522978 * 356522977 users:(("/usr/bin/python",pid=247567,fd=16)) u_str ESTAB 0 0 * 356522977 * 356522978 users:(("privsep-helper",pid=248453,fd=3)) root@node2:~# lsof -p 247567 | grep 16u /usr/bin/ 247567 neutron 16u unix 0xffff8e6d8abdb400 0t0 356522978 /tmp/tmp2afa3enn/privsep.sock type=STREAM While trying to debug a different issue, I encountered a situation where privsep hangs in the process of handling a request from neutron-openvswitch-agent when debug logging is enabled (juju debug-log neutron-openvswitch=true): https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/11 https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/12 The issue gets reproduced reliably in the environment where I encountered it on all units. As a result, neutron-openvswitch-agent services hang while waiting for a response from the privsep daemon and do not progress past basic initialization. They never post any state back to the Neutron server and thus are marked dead by it. The processes though are shown as "active (running)" by systemd which adds to the confusion since they do indeed start from the systemd's perspective. systemctl --no-pager status neutron-openvswitch-agent.service ● neutron-openvswitch-agent.service - Openstack Neutron Open vSwitch Plugin Agent    Loaded: loaded (/lib/systemd/system/neutron-openvswitch-agent.service; enabled; vendor preset: enabled)    Active: active (running) since Wed 2020-09-23 08:28:41 UTC; 25min ago  Main PID: 247772 (/usr/bin/python)     Tasks: 4 (limit: 9830)    CGroup: /system.slice/neutron-openvswitch-agent.service            ├─247772 /usr/bin/python3 /usr/bin/neutron-openvswitch-agent --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/plugins/ml2/openvswitch_…og            └─248272 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini -…ck -------------------------------------------------------- An strace shows that the privsep daemon tries to receive input from fd 3 which is the unix socket it uses to communicate with the client. However, this is just one tread out of many spawned by the privsep daemon so it is unlikely to be the root cause. # there is one extra neutron-openvvswitch-agent running in a LXD container which can be ignored here (there is an octavia unit on the node which has a neutron-openvswitch subordinate) root@node2:~# ps -eo pid,user,args --sort user | grep -P 'privsep.*openvswitch'  860690 100000 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmp910qakfk/privsep.sock  248272 root /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmpcmwn7vom/privsep.sock  363905 root grep --color=auto -P privsep.*openvswitch root@node2:~# strace -f -p 248453 2>&1 [pid 248786] futex(0x7f6a6401c1d0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...> [pid 248475] futex(0x7f6a6c024590, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...> [pid 248473] futex(0x7f6a746d9fd0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...> [pid 248453] recvfrom(3, root@node2:~# lsof -p 248453 | grep 3u privsep-h 248453 root 3u unix 0xffff8e6d8abdec00 0t0 356522977 type=STREAM root@node2:~# ss -pax | grep 356522977 u_str ESTAB 0 0 /tmp/tmp2afa3enn/privsep.sock 356522978                                   * 356522977 users:(("/usr/bin/python",pid=247567,fd=16)) u_str ESTAB 0 0 * 356522977                                   * 356522978 users:(("privsep-helper",pid=248453,fd=3)) root@node2:~# lsof -p 247567 | grep 16u /usr/bin/ 247567 neutron 16u unix 0xffff8e6d8abdb400 0t0 356522978 /tmp/tmp2afa3enn/privsep.sock type=STREAM
2020-09-23 09:04:24 Dmitrii Shcherbakov description While trying to debug a different issue, I encountered a situation where privsep hangs in the process of handling a request from neutron-openvswitch-agent when debug logging is enabled (juju debug-log neutron-openvswitch=true): https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/11 https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/12 The issue gets reproduced reliably in the environment where I encountered it on all units. As a result, neutron-openvswitch-agent services hang while waiting for a response from the privsep daemon and do not progress past basic initialization. They never post any state back to the Neutron server and thus are marked dead by it. The processes though are shown as "active (running)" by systemd which adds to the confusion since they do indeed start from the systemd's perspective. systemctl --no-pager status neutron-openvswitch-agent.service ● neutron-openvswitch-agent.service - Openstack Neutron Open vSwitch Plugin Agent    Loaded: loaded (/lib/systemd/system/neutron-openvswitch-agent.service; enabled; vendor preset: enabled)    Active: active (running) since Wed 2020-09-23 08:28:41 UTC; 25min ago  Main PID: 247772 (/usr/bin/python)     Tasks: 4 (limit: 9830)    CGroup: /system.slice/neutron-openvswitch-agent.service            ├─247772 /usr/bin/python3 /usr/bin/neutron-openvswitch-agent --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/plugins/ml2/openvswitch_…og            └─248272 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini -…ck -------------------------------------------------------- An strace shows that the privsep daemon tries to receive input from fd 3 which is the unix socket it uses to communicate with the client. However, this is just one tread out of many spawned by the privsep daemon so it is unlikely to be the root cause. # there is one extra neutron-openvvswitch-agent running in a LXD container which can be ignored here (there is an octavia unit on the node which has a neutron-openvswitch subordinate) root@node2:~# ps -eo pid,user,args --sort user | grep -P 'privsep.*openvswitch'  860690 100000 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmp910qakfk/privsep.sock  248272 root /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmpcmwn7vom/privsep.sock  363905 root grep --color=auto -P privsep.*openvswitch root@node2:~# strace -f -p 248453 2>&1 [pid 248786] futex(0x7f6a6401c1d0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...> [pid 248475] futex(0x7f6a6c024590, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...> [pid 248473] futex(0x7f6a746d9fd0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...> [pid 248453] recvfrom(3, root@node2:~# lsof -p 248453 | grep 3u privsep-h 248453 root 3u unix 0xffff8e6d8abdec00 0t0 356522977 type=STREAM root@node2:~# ss -pax | grep 356522977 u_str ESTAB 0 0 /tmp/tmp2afa3enn/privsep.sock 356522978                                   * 356522977 users:(("/usr/bin/python",pid=247567,fd=16)) u_str ESTAB 0 0 * 356522977                                   * 356522978 users:(("privsep-helper",pid=248453,fd=3)) root@node2:~# lsof -p 247567 | grep 16u /usr/bin/ 247567 neutron 16u unix 0xffff8e6d8abdb400 0t0 356522978 /tmp/tmp2afa3enn/privsep.sock type=STREAM While trying to debug a different issue, I encountered a situation where privsep hangs in the process of handling a request from neutron-openvswitch-agent when debug logging is enabled (juju debug-log neutron-openvswitch=true): https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/11 https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/12 The issue gets reproduced reliably in the environment where I encountered it on all units. As a result, neutron-openvswitch-agent services hang while waiting for a response from the privsep daemon and do not progress past basic initialization. They never post any state back to the Neutron server and thus are marked dead by it. The processes though are shown as "active (running)" by systemd which adds to the confusion since they do indeed start from the systemd's perspective. systemctl --no-pager status neutron-openvswitch-agent.service ● neutron-openvswitch-agent.service - Openstack Neutron Open vSwitch Plugin Agent    Loaded: loaded (/lib/systemd/system/neutron-openvswitch-agent.service; enabled; vendor preset: enabled)    Active: active (running) since Wed 2020-09-23 08:28:41 UTC; 25min ago  Main PID: 247772 (/usr/bin/python)     Tasks: 4 (limit: 9830)    CGroup: /system.slice/neutron-openvswitch-agent.service            ├─247772 /usr/bin/python3 /usr/bin/neutron-openvswitch-agent --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/plugins/ml2/openvswitch_…og            └─248272 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini -…ck -------------------------------------------------------- An strace shows that the privsep daemon tries to receive input from fd 3 which is the unix socket it uses to communicate with the client. However, this is just one tread out of many spawned by the privsep daemon so it is unlikely to be the root cause (there are 65 threads there in total, see https://paste.ubuntu.com/p/fbGvN2P8rP/) # there is one extra neutron-openvvswitch-agent running in a LXD container which can be ignored here (there is an octavia unit on the node which has a neutron-openvswitch subordinate) root@node2:~# ps -eo pid,user,args --sort user | grep -P 'privsep.*openvswitch'  860690 100000 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmp910qakfk/privsep.sock  248272 root /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmpcmwn7vom/privsep.sock  363905 root grep --color=auto -P privsep.*openvswitch root@node2:~# strace -f -p 248453 2>&1 [pid 248786] futex(0x7f6a6401c1d0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...> [pid 248475] futex(0x7f6a6c024590, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...> [pid 248473] futex(0x7f6a746d9fd0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...> [pid 248453] recvfrom(3, root@node2:~# lsof -p 248453 | grep 3u privsep-h 248453 root 3u unix 0xffff8e6d8abdec00 0t0 356522977 type=STREAM root@node2:~# ss -pax | grep 356522977 u_str ESTAB 0 0 /tmp/tmp2afa3enn/privsep.sock 356522978                                   * 356522977 users:(("/usr/bin/python",pid=247567,fd=16)) u_str ESTAB 0 0 * 356522977                                   * 356522978 users:(("privsep-helper",pid=248453,fd=3)) root@node2:~# lsof -p 247567 | grep 16u /usr/bin/ 247567 neutron 16u unix 0xffff8e6d8abdb400 0t0 356522978 /tmp/tmp2afa3enn/privsep.sock type=STREAM
2020-09-23 12:16:39 Dmitrii Shcherbakov bug watch added https://bugzilla.redhat.com/show_bug.cgi?id=1862364
2020-09-24 05:54:03 Dmitrii Shcherbakov charm-neutron-openvswitch: status New Incomplete
2020-09-24 14:29:20 Dmitrii Shcherbakov charm-neutron-openvswitch: status Incomplete New
2020-09-24 19:42:55 Dmitrii Shcherbakov bug task added python-oslo.privsep (Ubuntu)
2020-09-24 19:45:13 Dmitrii Shcherbakov summary A privsep daemon hangs when debug logging is enabled A privsep daemon spawned by neutron-openvswitch-agent hangs when debug logging is enabled (large number of registered NICs) - an RPC response is too large for msgpack
2020-09-24 19:45:57 Dmitrii Shcherbakov charm-neutron-openvswitch: status New Invalid
2020-09-24 19:51:57 Dmitrii Shcherbakov bug task added oslo.privsep
2020-09-24 19:56:07 Dmitrii Shcherbakov description While trying to debug a different issue, I encountered a situation where privsep hangs in the process of handling a request from neutron-openvswitch-agent when debug logging is enabled (juju debug-log neutron-openvswitch=true): https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/11 https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/12 The issue gets reproduced reliably in the environment where I encountered it on all units. As a result, neutron-openvswitch-agent services hang while waiting for a response from the privsep daemon and do not progress past basic initialization. They never post any state back to the Neutron server and thus are marked dead by it. The processes though are shown as "active (running)" by systemd which adds to the confusion since they do indeed start from the systemd's perspective. systemctl --no-pager status neutron-openvswitch-agent.service ● neutron-openvswitch-agent.service - Openstack Neutron Open vSwitch Plugin Agent    Loaded: loaded (/lib/systemd/system/neutron-openvswitch-agent.service; enabled; vendor preset: enabled)    Active: active (running) since Wed 2020-09-23 08:28:41 UTC; 25min ago  Main PID: 247772 (/usr/bin/python)     Tasks: 4 (limit: 9830)    CGroup: /system.slice/neutron-openvswitch-agent.service            ├─247772 /usr/bin/python3 /usr/bin/neutron-openvswitch-agent --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/plugins/ml2/openvswitch_…og            └─248272 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini -…ck -------------------------------------------------------- An strace shows that the privsep daemon tries to receive input from fd 3 which is the unix socket it uses to communicate with the client. However, this is just one tread out of many spawned by the privsep daemon so it is unlikely to be the root cause (there are 65 threads there in total, see https://paste.ubuntu.com/p/fbGvN2P8rP/) # there is one extra neutron-openvvswitch-agent running in a LXD container which can be ignored here (there is an octavia unit on the node which has a neutron-openvswitch subordinate) root@node2:~# ps -eo pid,user,args --sort user | grep -P 'privsep.*openvswitch'  860690 100000 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmp910qakfk/privsep.sock  248272 root /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmpcmwn7vom/privsep.sock  363905 root grep --color=auto -P privsep.*openvswitch root@node2:~# strace -f -p 248453 2>&1 [pid 248786] futex(0x7f6a6401c1d0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...> [pid 248475] futex(0x7f6a6c024590, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...> [pid 248473] futex(0x7f6a746d9fd0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...> [pid 248453] recvfrom(3, root@node2:~# lsof -p 248453 | grep 3u privsep-h 248453 root 3u unix 0xffff8e6d8abdec00 0t0 356522977 type=STREAM root@node2:~# ss -pax | grep 356522977 u_str ESTAB 0 0 /tmp/tmp2afa3enn/privsep.sock 356522978                                   * 356522977 users:(("/usr/bin/python",pid=247567,fd=16)) u_str ESTAB 0 0 * 356522977                                   * 356522978 users:(("privsep-helper",pid=248453,fd=3)) root@node2:~# lsof -p 247567 | grep 16u /usr/bin/ 247567 neutron 16u unix 0xffff8e6d8abdb400 0t0 356522978 /tmp/tmp2afa3enn/privsep.sock type=STREAM When there is a large amount of netdevs registered in the kernel and debug logging is enabled, neutron-openvswitch-agent and the privsep daemon spawned by it hang since the RPC call result sent by the privsep daemon over a unix socket exceeds the message sizes that the msgpack library can handle. The impact of this is that enabling debug logging on the cloud completely stalls neutron-openvswitch-agents and makes them "dead" from the Neutron server perspective. The issue is summarized in detail in comment #5 https://bugs.launchpad.net/oslo.privsep/+bug/1896734/comments/5 ======================================================================== Old Description While trying to debug a different issue, I encountered a situation where privsep hangs in the process of handling a request from neutron-openvswitch-agent when debug logging is enabled (juju debug-log neutron-openvswitch=true): https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/11 https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/12 The issue gets reproduced reliably in the environment where I encountered it on all units. As a result, neutron-openvswitch-agent services hang while waiting for a response from the privsep daemon and do not progress past basic initialization. They never post any state back to the Neutron server and thus are marked dead by it. The processes though are shown as "active (running)" by systemd which adds to the confusion since they do indeed start from the systemd's perspective. systemctl --no-pager status neutron-openvswitch-agent.service ● neutron-openvswitch-agent.service - Openstack Neutron Open vSwitch Plugin Agent    Loaded: loaded (/lib/systemd/system/neutron-openvswitch-agent.service; enabled; vendor preset: enabled)    Active: active (running) since Wed 2020-09-23 08:28:41 UTC; 25min ago  Main PID: 247772 (/usr/bin/python)     Tasks: 4 (limit: 9830)    CGroup: /system.slice/neutron-openvswitch-agent.service            ├─247772 /usr/bin/python3 /usr/bin/neutron-openvswitch-agent --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/plugins/ml2/openvswitch_…og            └─248272 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini -…ck -------------------------------------------------------- An strace shows that the privsep daemon tries to receive input from fd 3 which is the unix socket it uses to communicate with the client. However, this is just one tread out of many spawned by the privsep daemon so it is unlikely to be the root cause (there are 65 threads there in total, see https://paste.ubuntu.com/p/fbGvN2P8rP/) # there is one extra neutron-openvvswitch-agent running in a LXD container which can be ignored here (there is an octavia unit on the node which has a neutron-openvswitch subordinate) root@node2:~# ps -eo pid,user,args --sort user | grep -P 'privsep.*openvswitch'  860690 100000 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmp910qakfk/privsep.sock  248272 root /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmpcmwn7vom/privsep.sock  363905 root grep --color=auto -P privsep.*openvswitch root@node2:~# strace -f -p 248453 2>&1 [pid 248786] futex(0x7f6a6401c1d0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...> [pid 248475] futex(0x7f6a6c024590, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...> [pid 248473] futex(0x7f6a746d9fd0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...> [pid 248453] recvfrom(3, root@node2:~# lsof -p 248453 | grep 3u privsep-h 248453 root 3u unix 0xffff8e6d8abdec00 0t0 356522977 type=STREAM root@node2:~# ss -pax | grep 356522977 u_str ESTAB 0 0 /tmp/tmp2afa3enn/privsep.sock 356522978                                   * 356522977 users:(("/usr/bin/python",pid=247567,fd=16)) u_str ESTAB 0 0 * 356522977                                   * 356522978 users:(("privsep-helper",pid=248453,fd=3)) root@node2:~# lsof -p 247567 | grep 16u /usr/bin/ 247567 neutron 16u unix 0xffff8e6d8abdb400 0t0 356522978 /tmp/tmp2afa3enn/privsep.sock type=STREAM
2020-10-09 02:19:15 Trent Lloyd tags seg
2020-10-09 05:54:03 Dominique Poulain bug added subscriber Dominique Poulain
2020-11-25 09:55:53 Nobuto Murata bug added subscriber Nobuto Murata
2021-02-24 11:52:01 Rodolfo Alonso bug task added neutron
2021-02-24 11:52:13 Rodolfo Alonso neutron: importance Undecided Medium
2021-02-24 11:52:17 Rodolfo Alonso neutron: assignee Rodolfo Alonso (rodolfo-alonso-hernandez)
2021-03-09 09:43:06 Edward Hope-Morley neutron: status New Fix Released
2021-03-09 21:39:04 Corey Bryant bug task added neutron (Ubuntu)
2021-03-09 21:39:26 Corey Bryant nominated for series Ubuntu Focal
2021-03-09 21:39:26 Corey Bryant bug task added neutron (Ubuntu Focal)
2021-03-09 21:39:26 Corey Bryant bug task added python-oslo.privsep (Ubuntu Focal)
2021-03-09 21:39:26 Corey Bryant nominated for series Ubuntu Hirsute
2021-03-09 21:39:26 Corey Bryant bug task added neutron (Ubuntu Hirsute)
2021-03-09 21:39:26 Corey Bryant bug task added python-oslo.privsep (Ubuntu Hirsute)
2021-03-09 21:39:26 Corey Bryant nominated for series Ubuntu Groovy
2021-03-09 21:39:26 Corey Bryant bug task added neutron (Ubuntu Groovy)
2021-03-09 21:39:26 Corey Bryant bug task added python-oslo.privsep (Ubuntu Groovy)
2021-03-09 21:40:02 Corey Bryant neutron (Ubuntu Focal): importance Undecided Medium
2021-03-09 21:40:02 Corey Bryant neutron (Ubuntu Focal): status New Triaged
2021-03-09 21:40:19 Corey Bryant neutron (Ubuntu Groovy): importance Undecided Medium
2021-03-09 21:40:19 Corey Bryant neutron (Ubuntu Groovy): status New Triaged
2021-03-09 21:40:33 Corey Bryant neutron (Ubuntu Hirsute): importance Undecided Medium
2021-03-09 21:40:33 Corey Bryant neutron (Ubuntu Hirsute): status New Triaged
2021-03-09 21:40:45 Corey Bryant bug task added cloud-archive
2021-03-09 21:40:56 Corey Bryant nominated for series cloud-archive/ussuri
2021-03-09 21:40:56 Corey Bryant bug task added cloud-archive/ussuri
2021-03-09 21:40:56 Corey Bryant nominated for series cloud-archive/victoria
2021-03-09 21:40:56 Corey Bryant bug task added cloud-archive/victoria
2021-03-09 21:41:10 Corey Bryant cloud-archive/ussuri: importance Undecided Medium
2021-03-09 21:41:10 Corey Bryant cloud-archive/ussuri: status New Triaged
2021-03-09 21:41:24 Corey Bryant cloud-archive/victoria: importance Undecided Medium
2021-03-09 21:41:24 Corey Bryant cloud-archive/victoria: status New Triaged
2021-03-10 20:40:43 Launchpad Janitor neutron (Ubuntu Hirsute): status Triaged Fix Released
2021-03-11 12:40:32 Łukasz Zemczak neutron (Ubuntu Groovy): status Triaged Fix Committed
2021-03-11 12:40:40 Łukasz Zemczak bug added subscriber Ubuntu Stable Release Updates Team
2021-03-11 12:40:43 Łukasz Zemczak bug added subscriber SRU Verification
2021-03-11 12:40:48 Łukasz Zemczak tags seg seg verification-needed verification-needed-groovy
2021-03-11 13:01:10 Łukasz Zemczak neutron (Ubuntu Focal): status Triaged Fix Committed
2021-03-11 13:01:17 Łukasz Zemczak tags seg verification-needed verification-needed-groovy seg verification-needed verification-needed-focal verification-needed-groovy
2021-03-11 18:07:53 Corey Bryant cloud-archive/victoria: status Triaged Fix Committed
2021-03-11 18:07:56 Corey Bryant tags seg verification-needed verification-needed-focal verification-needed-groovy seg verification-needed verification-needed-focal verification-needed-groovy verification-victoria-needed
2021-03-11 18:07:58 Corey Bryant cloud-archive/ussuri: status Triaged Fix Committed
2021-03-18 15:34:06 Edward Hope-Morley description When there is a large amount of netdevs registered in the kernel and debug logging is enabled, neutron-openvswitch-agent and the privsep daemon spawned by it hang since the RPC call result sent by the privsep daemon over a unix socket exceeds the message sizes that the msgpack library can handle. The impact of this is that enabling debug logging on the cloud completely stalls neutron-openvswitch-agents and makes them "dead" from the Neutron server perspective. The issue is summarized in detail in comment #5 https://bugs.launchpad.net/oslo.privsep/+bug/1896734/comments/5 ======================================================================== Old Description While trying to debug a different issue, I encountered a situation where privsep hangs in the process of handling a request from neutron-openvswitch-agent when debug logging is enabled (juju debug-log neutron-openvswitch=true): https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/11 https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/12 The issue gets reproduced reliably in the environment where I encountered it on all units. As a result, neutron-openvswitch-agent services hang while waiting for a response from the privsep daemon and do not progress past basic initialization. They never post any state back to the Neutron server and thus are marked dead by it. The processes though are shown as "active (running)" by systemd which adds to the confusion since they do indeed start from the systemd's perspective. systemctl --no-pager status neutron-openvswitch-agent.service ● neutron-openvswitch-agent.service - Openstack Neutron Open vSwitch Plugin Agent    Loaded: loaded (/lib/systemd/system/neutron-openvswitch-agent.service; enabled; vendor preset: enabled)    Active: active (running) since Wed 2020-09-23 08:28:41 UTC; 25min ago  Main PID: 247772 (/usr/bin/python)     Tasks: 4 (limit: 9830)    CGroup: /system.slice/neutron-openvswitch-agent.service            ├─247772 /usr/bin/python3 /usr/bin/neutron-openvswitch-agent --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/plugins/ml2/openvswitch_…og            └─248272 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini -…ck -------------------------------------------------------- An strace shows that the privsep daemon tries to receive input from fd 3 which is the unix socket it uses to communicate with the client. However, this is just one tread out of many spawned by the privsep daemon so it is unlikely to be the root cause (there are 65 threads there in total, see https://paste.ubuntu.com/p/fbGvN2P8rP/) # there is one extra neutron-openvvswitch-agent running in a LXD container which can be ignored here (there is an octavia unit on the node which has a neutron-openvswitch subordinate) root@node2:~# ps -eo pid,user,args --sort user | grep -P 'privsep.*openvswitch'  860690 100000 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmp910qakfk/privsep.sock  248272 root /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmpcmwn7vom/privsep.sock  363905 root grep --color=auto -P privsep.*openvswitch root@node2:~# strace -f -p 248453 2>&1 [pid 248786] futex(0x7f6a6401c1d0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...> [pid 248475] futex(0x7f6a6c024590, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...> [pid 248473] futex(0x7f6a746d9fd0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...> [pid 248453] recvfrom(3, root@node2:~# lsof -p 248453 | grep 3u privsep-h 248453 root 3u unix 0xffff8e6d8abdec00 0t0 356522977 type=STREAM root@node2:~# ss -pax | grep 356522977 u_str ESTAB 0 0 /tmp/tmp2afa3enn/privsep.sock 356522978                                   * 356522977 users:(("/usr/bin/python",pid=247567,fd=16)) u_str ESTAB 0 0 * 356522977                                   * 356522978 users:(("privsep-helper",pid=248453,fd=3)) root@node2:~# lsof -p 247567 | grep 16u /usr/bin/ 247567 neutron 16u unix 0xffff8e6d8abdb400 0t0 356522978 /tmp/tmp2afa3enn/privsep.sock type=STREAM [Impact] When there is a large amount of netdevs registered in the kernel and debug logging is enabled, neutron-openvswitch-agent and the privsep daemon spawned by it hang since the RPC call result sent by the privsep daemon over a unix socket exceeds the message sizes that the msgpack library can handle. The impact of this is that enabling debug logging on the cloud completely stalls neutron-openvswitch-agents and makes them "dead" from the Neutron server perspective. The issue is summarized in detail in comment #5 https://bugs.launchpad.net/oslo.privsep/+bug/1896734/comments/5 [Test Plan] * deploy Openstack Train/Ussuri/Victoria * need at least one compute host * enable neutron debug logging * create a load of interfaces on your compute host to create a large 'ip addr show' output * for ((i=0;i<400;i++)); do ip tuntap add mode tap tap-`uuidgen| cut -c1-11`; done * create a single vm * add floating ip * ping fip * create 20 ports and attach them to the vm * for ((i=0;i<20;i++)); do id=`uuidgen`; openstack port create --network private --security-group ab698dc1-4b87-4b46-9c38-cd53e3ec7492 X-$id; openstack server add port X-$id; done * attaching ports should not result in errors [Where problems could occur] No problems anticipated this patchset. ======================================================================== When there is a large amount of netdevs registered in the kernel and debug logging is enabled, neutron-openvswitch-agent and the privsep daemon spawned by it hang since the RPC call result sent by the privsep daemon over a unix socket exceeds the message sizes that the msgpack library can handle. The impact of this is that enabling debug logging on the cloud completely stalls neutron-openvswitch-agents and makes them "dead" from the Neutron server perspective. The issue is summarized in detail in comment #5 https://bugs.launchpad.net/oslo.privsep/+bug/1896734/comments/5 ======================================================================== Old Description While trying to debug a different issue, I encountered a situation where privsep hangs in the process of handling a request from neutron-openvswitch-agent when debug logging is enabled (juju debug-log neutron-openvswitch=true): https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/11 https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/12 The issue gets reproduced reliably in the environment where I encountered it on all units. As a result, neutron-openvswitch-agent services hang while waiting for a response from the privsep daemon and do not progress past basic initialization. They never post any state back to the Neutron server and thus are marked dead by it. The processes though are shown as "active (running)" by systemd which adds to the confusion since they do indeed start from the systemd's perspective. systemctl --no-pager status neutron-openvswitch-agent.service ● neutron-openvswitch-agent.service - Openstack Neutron Open vSwitch Plugin Agent    Loaded: loaded (/lib/systemd/system/neutron-openvswitch-agent.service; enabled; vendor preset: enabled)    Active: active (running) since Wed 2020-09-23 08:28:41 UTC; 25min ago  Main PID: 247772 (/usr/bin/python)     Tasks: 4 (limit: 9830)    CGroup: /system.slice/neutron-openvswitch-agent.service            ├─247772 /usr/bin/python3 /usr/bin/neutron-openvswitch-agent --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/plugins/ml2/openvswitch_…og            └─248272 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini -…ck -------------------------------------------------------- An strace shows that the privsep daemon tries to receive input from fd 3 which is the unix socket it uses to communicate with the client. However, this is just one tread out of many spawned by the privsep daemon so it is unlikely to be the root cause (there are 65 threads there in total, see https://paste.ubuntu.com/p/fbGvN2P8rP/) # there is one extra neutron-openvvswitch-agent running in a LXD container which can be ignored here (there is an octavia unit on the node which has a neutron-openvswitch subordinate) root@node2:~# ps -eo pid,user,args --sort user | grep -P 'privsep.*openvswitch'  860690 100000 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmp910qakfk/privsep.sock  248272 root /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmpcmwn7vom/privsep.sock  363905 root grep --color=auto -P privsep.*openvswitch root@node2:~# strace -f -p 248453 2>&1 [pid 248786] futex(0x7f6a6401c1d0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...> [pid 248475] futex(0x7f6a6c024590, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...> [pid 248473] futex(0x7f6a746d9fd0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...> [pid 248453] recvfrom(3, root@node2:~# lsof -p 248453 | grep 3u privsep-h 248453 root 3u unix 0xffff8e6d8abdec00 0t0 356522977 type=STREAM root@node2:~# ss -pax | grep 356522977 u_str ESTAB 0 0 /tmp/tmp2afa3enn/privsep.sock 356522978                                   * 356522977 users:(("/usr/bin/python",pid=247567,fd=16)) u_str ESTAB 0 0 * 356522977                                   * 356522978 users:(("privsep-helper",pid=248453,fd=3)) root@node2:~# lsof -p 247567 | grep 16u /usr/bin/ 247567 neutron 16u unix 0xffff8e6d8abdb400 0t0 356522978 /tmp/tmp2afa3enn/privsep.sock type=STREAM
2021-03-18 15:35:21 Edward Hope-Morley tags seg verification-needed verification-needed-focal verification-needed-groovy verification-victoria-needed seg verification-done-groovy verification-needed verification-needed-focal verification-victoria-needed
2021-03-18 19:52:52 Edward Hope-Morley description [Impact] When there is a large amount of netdevs registered in the kernel and debug logging is enabled, neutron-openvswitch-agent and the privsep daemon spawned by it hang since the RPC call result sent by the privsep daemon over a unix socket exceeds the message sizes that the msgpack library can handle. The impact of this is that enabling debug logging on the cloud completely stalls neutron-openvswitch-agents and makes them "dead" from the Neutron server perspective. The issue is summarized in detail in comment #5 https://bugs.launchpad.net/oslo.privsep/+bug/1896734/comments/5 [Test Plan] * deploy Openstack Train/Ussuri/Victoria * need at least one compute host * enable neutron debug logging * create a load of interfaces on your compute host to create a large 'ip addr show' output * for ((i=0;i<400;i++)); do ip tuntap add mode tap tap-`uuidgen| cut -c1-11`; done * create a single vm * add floating ip * ping fip * create 20 ports and attach them to the vm * for ((i=0;i<20;i++)); do id=`uuidgen`; openstack port create --network private --security-group ab698dc1-4b87-4b46-9c38-cd53e3ec7492 X-$id; openstack server add port X-$id; done * attaching ports should not result in errors [Where problems could occur] No problems anticipated this patchset. ======================================================================== When there is a large amount of netdevs registered in the kernel and debug logging is enabled, neutron-openvswitch-agent and the privsep daemon spawned by it hang since the RPC call result sent by the privsep daemon over a unix socket exceeds the message sizes that the msgpack library can handle. The impact of this is that enabling debug logging on the cloud completely stalls neutron-openvswitch-agents and makes them "dead" from the Neutron server perspective. The issue is summarized in detail in comment #5 https://bugs.launchpad.net/oslo.privsep/+bug/1896734/comments/5 ======================================================================== Old Description While trying to debug a different issue, I encountered a situation where privsep hangs in the process of handling a request from neutron-openvswitch-agent when debug logging is enabled (juju debug-log neutron-openvswitch=true): https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/11 https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/12 The issue gets reproduced reliably in the environment where I encountered it on all units. As a result, neutron-openvswitch-agent services hang while waiting for a response from the privsep daemon and do not progress past basic initialization. They never post any state back to the Neutron server and thus are marked dead by it. The processes though are shown as "active (running)" by systemd which adds to the confusion since they do indeed start from the systemd's perspective. systemctl --no-pager status neutron-openvswitch-agent.service ● neutron-openvswitch-agent.service - Openstack Neutron Open vSwitch Plugin Agent    Loaded: loaded (/lib/systemd/system/neutron-openvswitch-agent.service; enabled; vendor preset: enabled)    Active: active (running) since Wed 2020-09-23 08:28:41 UTC; 25min ago  Main PID: 247772 (/usr/bin/python)     Tasks: 4 (limit: 9830)    CGroup: /system.slice/neutron-openvswitch-agent.service            ├─247772 /usr/bin/python3 /usr/bin/neutron-openvswitch-agent --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/plugins/ml2/openvswitch_…og            └─248272 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini -…ck -------------------------------------------------------- An strace shows that the privsep daemon tries to receive input from fd 3 which is the unix socket it uses to communicate with the client. However, this is just one tread out of many spawned by the privsep daemon so it is unlikely to be the root cause (there are 65 threads there in total, see https://paste.ubuntu.com/p/fbGvN2P8rP/) # there is one extra neutron-openvvswitch-agent running in a LXD container which can be ignored here (there is an octavia unit on the node which has a neutron-openvswitch subordinate) root@node2:~# ps -eo pid,user,args --sort user | grep -P 'privsep.*openvswitch'  860690 100000 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmp910qakfk/privsep.sock  248272 root /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmpcmwn7vom/privsep.sock  363905 root grep --color=auto -P privsep.*openvswitch root@node2:~# strace -f -p 248453 2>&1 [pid 248786] futex(0x7f6a6401c1d0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...> [pid 248475] futex(0x7f6a6c024590, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...> [pid 248473] futex(0x7f6a746d9fd0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...> [pid 248453] recvfrom(3, root@node2:~# lsof -p 248453 | grep 3u privsep-h 248453 root 3u unix 0xffff8e6d8abdec00 0t0 356522977 type=STREAM root@node2:~# ss -pax | grep 356522977 u_str ESTAB 0 0 /tmp/tmp2afa3enn/privsep.sock 356522978                                   * 356522977 users:(("/usr/bin/python",pid=247567,fd=16)) u_str ESTAB 0 0 * 356522977                                   * 356522978 users:(("privsep-helper",pid=248453,fd=3)) root@node2:~# lsof -p 247567 | grep 16u /usr/bin/ 247567 neutron 16u unix 0xffff8e6d8abdb400 0t0 356522978 /tmp/tmp2afa3enn/privsep.sock type=STREAM [Impact] When there is a large amount of netdevs registered in the kernel and debug logging is enabled, neutron-openvswitch-agent and the privsep daemon spawned by it hang since the RPC call result sent by the privsep daemon over a unix socket exceeds the message sizes that the msgpack library can handle. The impact of this is that enabling debug logging on the cloud completely stalls neutron-openvswitch-agents and makes them "dead" from the Neutron server perspective. The issue is summarized in detail in comment #5 https://bugs.launchpad.net/oslo.privsep/+bug/1896734/comments/5 [Test Plan]   * deploy Openstack Train/Ussuri/Victoria   * need at least one compute host   * enable neutron debug logging   * create a load of interfaces on your compute host to create a large 'ip addr show' output   * for ((i=0;i<400;i++)); do ip tuntap add mode tap tap-`uuidgen| cut -c1-11`; done   * create a single vm   * add floating ip   * ping fip   * create 20 ports and attach them to the vm   * for ((i=0;i<20;i++)); do id=`uuidgen`; openstack port create --network private --security-group __SG__ X-$id; openstack server add port __VM__ X-$id; done   * attaching ports should not result in errors [Where problems could occur] No problems anticipated this patchset. ======================================================================== When there is a large amount of netdevs registered in the kernel and debug logging is enabled, neutron-openvswitch-agent and the privsep daemon spawned by it hang since the RPC call result sent by the privsep daemon over a unix socket exceeds the message sizes that the msgpack library can handle. The impact of this is that enabling debug logging on the cloud completely stalls neutron-openvswitch-agents and makes them "dead" from the Neutron server perspective. The issue is summarized in detail in comment #5 https://bugs.launchpad.net/oslo.privsep/+bug/1896734/comments/5 ======================================================================== Old Description While trying to debug a different issue, I encountered a situation where privsep hangs in the process of handling a request from neutron-openvswitch-agent when debug logging is enabled (juju debug-log neutron-openvswitch=true): https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/11 https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/12 The issue gets reproduced reliably in the environment where I encountered it on all units. As a result, neutron-openvswitch-agent services hang while waiting for a response from the privsep daemon and do not progress past basic initialization. They never post any state back to the Neutron server and thus are marked dead by it. The processes though are shown as "active (running)" by systemd which adds to the confusion since they do indeed start from the systemd's perspective. systemctl --no-pager status neutron-openvswitch-agent.service ● neutron-openvswitch-agent.service - Openstack Neutron Open vSwitch Plugin Agent    Loaded: loaded (/lib/systemd/system/neutron-openvswitch-agent.service; enabled; vendor preset: enabled)    Active: active (running) since Wed 2020-09-23 08:28:41 UTC; 25min ago  Main PID: 247772 (/usr/bin/python)     Tasks: 4 (limit: 9830)    CGroup: /system.slice/neutron-openvswitch-agent.service            ├─247772 /usr/bin/python3 /usr/bin/neutron-openvswitch-agent --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/plugins/ml2/openvswitch_…og            └─248272 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini -…ck -------------------------------------------------------- An strace shows that the privsep daemon tries to receive input from fd 3 which is the unix socket it uses to communicate with the client. However, this is just one tread out of many spawned by the privsep daemon so it is unlikely to be the root cause (there are 65 threads there in total, see https://paste.ubuntu.com/p/fbGvN2P8rP/) # there is one extra neutron-openvvswitch-agent running in a LXD container which can be ignored here (there is an octavia unit on the node which has a neutron-openvswitch subordinate) root@node2:~# ps -eo pid,user,args --sort user | grep -P 'privsep.*openvswitch'  860690 100000 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmp910qakfk/privsep.sock  248272 root /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmpcmwn7vom/privsep.sock  363905 root grep --color=auto -P privsep.*openvswitch root@node2:~# strace -f -p 248453 2>&1 [pid 248786] futex(0x7f6a6401c1d0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...> [pid 248475] futex(0x7f6a6c024590, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...> [pid 248473] futex(0x7f6a746d9fd0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...> [pid 248453] recvfrom(3, root@node2:~# lsof -p 248453 | grep 3u privsep-h 248453 root 3u unix 0xffff8e6d8abdec00 0t0 356522977 type=STREAM root@node2:~# ss -pax | grep 356522977 u_str ESTAB 0 0 /tmp/tmp2afa3enn/privsep.sock 356522978                                   * 356522977 users:(("/usr/bin/python",pid=247567,fd=16)) u_str ESTAB 0 0 * 356522977                                   * 356522978 users:(("privsep-helper",pid=248453,fd=3)) root@node2:~# lsof -p 247567 | grep 16u /usr/bin/ 247567 neutron 16u unix 0xffff8e6d8abdb400 0t0 356522978 /tmp/tmp2afa3enn/privsep.sock type=STREAM
2021-03-18 20:06:34 Edward Hope-Morley tags seg verification-done-groovy verification-needed verification-needed-focal verification-victoria-needed seg verification-done-groovy verification-needed verification-needed-focal verification-victoria-done
2021-03-18 22:08:59 Edward Hope-Morley tags seg verification-done-groovy verification-needed verification-needed-focal verification-victoria-done seg verification-done-focal verification-done-groovy verification-needed verification-victoria-done
2021-03-22 19:20:53 Corey Bryant tags seg verification-done-focal verification-done-groovy verification-needed verification-victoria-done seg verification-done verification-done-focal verification-done-groovy verification-victoria-done
2021-03-23 23:42:22 Brian Murray removed subscriber Ubuntu Stable Release Updates Team
2021-03-23 23:43:57 Launchpad Janitor neutron (Ubuntu Focal): status Fix Committed Fix Released
2021-03-23 23:52:18 Launchpad Janitor neutron (Ubuntu Groovy): status Fix Committed Fix Released
2021-03-24 07:15:42 Chris MacNaughton cloud-archive/victoria: status Fix Committed Fix Released
2021-03-29 14:06:12 Chris MacNaughton tags seg verification-done verification-done-focal verification-done-groovy verification-victoria-done seg verification-done verification-done-focal verification-done-groovy verification-ussuri-needed verification-victoria-done
2021-03-31 18:52:55 Edward Hope-Morley tags seg verification-done verification-done-focal verification-done-groovy verification-ussuri-needed verification-victoria-done seg verification-done verification-done-focal verification-done-groovy verification-ussuri-done verification-victoria-done
2021-04-01 12:55:53 Chris MacNaughton cloud-archive/ussuri: status Fix Committed Fix Released
2021-04-06 13:34:50 Corey Bryant cloud-archive: status Fix Committed Fix Released
2021-07-21 13:44:50 Christian Rohmann bug added subscriber Christian Rohmann
2021-07-28 23:18:09 Brian Murray python-oslo.privsep (Ubuntu Groovy): status New Won't Fix
2021-12-01 07:07:19 Dr. Jens Harbott bug added subscriber Dr. Jens Harbott
2021-12-01 07:25:55 OpenStack Infra oslo.privsep: status New In Progress
2022-03-22 14:10:11 OpenStack Infra oslo.privsep: status In Progress Fix Released
2023-08-04 12:37:41 Edward Hope-Morley nominated for series Ubuntu Jammy
2023-08-04 12:37:41 Edward Hope-Morley bug task added neutron (Ubuntu Jammy)
2023-08-04 12:37:41 Edward Hope-Morley bug task added python-oslo.privsep (Ubuntu Jammy)
2023-08-04 12:38:09 Edward Hope-Morley bug task deleted python-oslo.privsep (Ubuntu Groovy)
2023-08-04 12:38:14 Edward Hope-Morley bug task deleted python-oslo.privsep (Ubuntu Hirsute)