* the privsep daemon process is the server which responds to a client (neutron-openvswitch-agent) request "get_devices_with_ip"
* File "/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 1364, in get_devices_with_ip
* File "/usr/lib/python3/dist-packages/oslo_privsep/daemon.py", line 195, in remote_call
* when debug logging is enabled, the privsep daemon returns a log message to be printed by the caller to its log which includes the full reply;
* the client has a separate reader thread which reads messages from the unix socket in a loop and sets results for previously created Futures associated with message ids. The original thread that initiates a request waits for a result associated with the Future to appear;
* msgpack is used for serialization of data sent of the unix socket;
* the node has hardware offloading enabled and there are many network interfaces registered with the kernel. The `ip addr show` output alone has 764 lines:
root@node2:~# ip a s | wc -l
764
* The log message that the privsep daemon attempts to send is 1483120 bytes = 1.414 MiB (not counting the overhead).
* The client reader thread tries to receive this message over the unix socket but receives an error from msgpack which does not like the message size:
(Pdb) l
131 import rpdb ; rpdb.set_trace()
132 with self.lock:
133 self.running = True
134 for msg in reader:
135 msgid, data = msg
136 -> if msgid is None:
137 self.out_of_band(data)
138 else:
139 with self.lock:
140 if msgid not in self.outstanding_msgs:
141 raise AssertionError("msgid should in "
(Pdb) n
> /usr/lib/python3/dist-packages/oslo_privsep/comm.py(137)_reader_main()
-> self.out_of_band(data)
(Pdb) n
> /usr/lib/python3/dist-packages/oslo_privsep/comm.py(134)_reader_main()
-> for msg in reader:
(Pdb) n
ValueError: ('%s exceeds max_str_len(%s)', 1482520, 1048576)
* The above exception never gets logged by neutron-openvswitch-agent and the requesting thread is stuck while waiting for a result from a Future object.
~nikolay.vinogradov and I found the root cause after a long debugging session.
What we have:
* a bionic-ussuri deployment with the HWE (focal 5.4) kernel;
* 2 processes communicating over a type=STREAM unix socket (blocking); /elixir. bootlin. com/linux/ v5.4.67/ source/ include/ net/sock. h#L2557
* Unix sockets have max buffer limits which means the amount of data that can be written there before a blocking occurs is far from infinite: https:/
* the privsep daemon process is the server which responds to a client (neutron- openvswitch- agent) request "get_devices_ with_ip"
* File "/usr/lib/ python3/ dist-packages/ neutron/ agent/linux/ ip_lib. py", line 1364, in get_devices_with_ip python3/ dist-packages/ oslo_privsep/ daemon. py", line 195, in remote_call
* File "/usr/lib/
* when debug logging is enabled, the privsep daemon returns a log message to be printed by the caller to its log which includes the full reply;
* the client has a separate reader thread which reads messages from the unix socket in a loop and sets results for previously created Futures associated with message ids. The original thread that initiates a request waits for a result associated with the Future to appear;
https:/ /github. com/openstack/ oslo.privsep/ blob/stable/ ussuri/ oslo_privsep/ comm.py# L120-L124 (thread creation) /github. com/openstack/ oslo.privsep/ blob/stable/ ussuri/ oslo_privsep/ comm.py# L130-L155 (the thread's main function)
https:/
https:/ /github. com/openstack/ oslo.privsep/ blob/stable/ ussuri/ oslo_privsep/ comm.py# L172 (the code that waits for the result of a future to appear).
* msgpack is used for serialization of data sent of the unix socket;
* the node has hardware offloading enabled and there are many network interfaces registered with the kernel. The `ip addr show` output alone has 764 lines:
root@node2:~# ip a s | wc -l
764
* The log message that the privsep daemon attempts to send is 1483120 bytes = 1.414 MiB (not counting the overhead).
https:/ /paste. ubuntu. com/p/Sf2wrcpWv B/ bin/privsep- helper( 10)<module> () helper_ main()) lib/python3/ dist-packag lib/python3/ dist-packages/ oslo_privsep/ daemon. py(170) emit() send((None, (Message.LOG, data))) python3/ dist-packages/ oslo_privsep/ comm.py( 202)send( ) send(msg)
(Pdb) where
/usr/
-> sys.exit(
/usr/
/usr/
-> self.channel.
> /usr/lib/
-> self.writer.
(Pdb) len(str(msg))
1483120
* The client reader thread tries to receive this message over the unix socket but receives an error from msgpack which does not like the message size:
(Pdb) l of_band( data) g_msgs: "msgid should in " python3/ dist-packages/ oslo_privsep/ comm.py( 137)_reader_ main() of_band( data) python3/ dist-packages/ oslo_privsep/ comm.py( 134)_reader_ main()
131 import rpdb ; rpdb.set_trace()
132 with self.lock:
133 self.running = True
134 for msg in reader:
135 msgid, data = msg
136 -> if msgid is None:
137 self.out_
138 else:
139 with self.lock:
140 if msgid not in self.outstandin
141 raise AssertionError(
(Pdb) n
> /usr/lib/
-> self.out_
(Pdb) n
> /usr/lib/
-> for msg in reader:
(Pdb) n
ValueError: ('%s exceeds max_str_len(%s)', 1482520, 1048576)
https:/ /git.launchpad. net/ubuntu/ +source/ python- msgpack/ tree/msgpack/ unpack. h?h=ubuntu/ focal#n229
PyErr_ Format( PyExc_ValueErro r, "%u exceeds max_str_len(%zd)", l, u->max_str_len); /github. com/msgpack/ msgpack- python/ blob/v0. 6.2/msgpack/ unpack. h#L226- L229
https:/
* The above exception never gets logged by neutron- openvswitch- agent and the requesting thread is stuck while waiting for a result from a Future object.