~nikolay.vinogradov and I found the root cause after a long debugging session. What we have: * a bionic-ussuri deployment with the HWE (focal 5.4) kernel; * 2 processes communicating over a type=STREAM unix socket (blocking); * Unix sockets have max buffer limits which means the amount of data that can be written there before a blocking occurs is far from infinite: https://elixir.bootlin.com/linux/v5.4.67/source/include/net/sock.h#L2557 * the privsep daemon process is the server which responds to a client (neutron-openvswitch-agent) request "get_devices_with_ip" * File "/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 1364, in get_devices_with_ip * File "/usr/lib/python3/dist-packages/oslo_privsep/daemon.py", line 195, in remote_call * when debug logging is enabled, the privsep daemon returns a log message to be printed by the caller to its log which includes the full reply; * the client has a separate reader thread which reads messages from the unix socket in a loop and sets results for previously created Futures associated with message ids. The original thread that initiates a request waits for a result associated with the Future to appear; https://github.com/openstack/oslo.privsep/blob/stable/ussuri/oslo_privsep/comm.py#L120-L124 (thread creation) https://github.com/openstack/oslo.privsep/blob/stable/ussuri/oslo_privsep/comm.py#L130-L155 (the thread's main function) https://github.com/openstack/oslo.privsep/blob/stable/ussuri/oslo_privsep/comm.py#L172 (the code that waits for the result of a future to appear). * msgpack is used for serialization of data sent of the unix socket; * the node has hardware offloading enabled and there are many network interfaces registered with the kernel. The `ip addr show` output alone has 764 lines: root@node2:~# ip a s | wc -l 764 * The log message that the privsep daemon attempts to send is 1483120 bytes = 1.414 MiB (not counting the overhead). https://paste.ubuntu.com/p/Sf2wrcpWvB/ (Pdb) where /usr/bin/privsep-helper(10)() -> sys.exit(helper_main()) /usr/lib/python3/dist-packag /usr/lib/python3/dist-packages/oslo_privsep/daemon.py(170)emit() -> self.channel.send((None, (Message.LOG, data))) > /usr/lib/python3/dist-packages/oslo_privsep/comm.py(202)send() -> self.writer.send(msg) (Pdb) len(str(msg)) 1483120 * The client reader thread tries to receive this message over the unix socket but receives an error from msgpack which does not like the message size: (Pdb) l 131 import rpdb ; rpdb.set_trace() 132 with self.lock: 133 self.running = True 134 for msg in reader: 135 msgid, data = msg 136 -> if msgid is None: 137 self.out_of_band(data) 138 else: 139 with self.lock: 140 if msgid not in self.outstanding_msgs: 141 raise AssertionError("msgid should in " (Pdb) n > /usr/lib/python3/dist-packages/oslo_privsep/comm.py(137)_reader_main() -> self.out_of_band(data) (Pdb) n > /usr/lib/python3/dist-packages/oslo_privsep/comm.py(134)_reader_main() -> for msg in reader: (Pdb) n ValueError: ('%s exceeds max_str_len(%s)', 1482520, 1048576) https://git.launchpad.net/ubuntu/+source/python-msgpack/tree/msgpack/unpack.h?h=ubuntu/focal#n229 PyErr_Format(PyExc_ValueError, "%u exceeds max_str_len(%zd)", l, u->max_str_len); https://github.com/msgpack/msgpack-python/blob/v0.6.2/msgpack/unpack.h#L226-L229 * The above exception never gets logged by neutron-openvswitch-agent and the requesting thread is stuck while waiting for a result from a Future object.