Comment 5 for bug 1896734

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote : Re: A privsep daemon hangs when debug logging is enabled

~nikolay.vinogradov and I found the root cause after a long debugging session.

What we have:

* a bionic-ussuri deployment with the HWE (focal 5.4) kernel;

* 2 processes communicating over a type=STREAM unix socket (blocking);
* Unix sockets have max buffer limits which means the amount of data that can be written there before a blocking occurs is far from infinite: https://elixir.bootlin.com/linux/v5.4.67/source/include/net/sock.h#L2557

* the privsep daemon process is the server which responds to a client (neutron-openvswitch-agent) request "get_devices_with_ip"

   * File "/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 1364, in get_devices_with_ip
   * File "/usr/lib/python3/dist-packages/oslo_privsep/daemon.py", line 195, in remote_call

* when debug logging is enabled, the privsep daemon returns a log message to be printed by the caller to its log which includes the full reply;

* the client has a separate reader thread which reads messages from the unix socket in a loop and sets results for previously created Futures associated with message ids. The original thread that initiates a request waits for a result associated with the Future to appear;

https://github.com/openstack/oslo.privsep/blob/stable/ussuri/oslo_privsep/comm.py#L120-L124 (thread creation)
https://github.com/openstack/oslo.privsep/blob/stable/ussuri/oslo_privsep/comm.py#L130-L155 (the thread's main function)

https://github.com/openstack/oslo.privsep/blob/stable/ussuri/oslo_privsep/comm.py#L172 (the code that waits for the result of a future to appear).

* msgpack is used for serialization of data sent of the unix socket;

* the node has hardware offloading enabled and there are many network interfaces registered with the kernel. The `ip addr show` output alone has 764 lines:

root@node2:~# ip a s | wc -l
764

* The log message that the privsep daemon attempts to send is 1483120 bytes = 1.414 MiB (not counting the overhead).

https://paste.ubuntu.com/p/Sf2wrcpWvB/
(Pdb) where
  /usr/bin/privsep-helper(10)<module>()
-> sys.exit(helper_main())
  /usr/lib/python3/dist-packag
  /usr/lib/python3/dist-packages/oslo_privsep/daemon.py(170)emit()
-> self.channel.send((None, (Message.LOG, data)))
> /usr/lib/python3/dist-packages/oslo_privsep/comm.py(202)send()
-> self.writer.send(msg)

(Pdb) len(str(msg))
1483120

* The client reader thread tries to receive this message over the unix socket but receives an error from msgpack which does not like the message size:

(Pdb) l
131 import rpdb ; rpdb.set_trace()
132 with self.lock:
133 self.running = True
134 for msg in reader:
135 msgid, data = msg
136 -> if msgid is None:
137 self.out_of_band(data)
138 else:
139 with self.lock:
140 if msgid not in self.outstanding_msgs:
141 raise AssertionError("msgid should in "
(Pdb) n
> /usr/lib/python3/dist-packages/oslo_privsep/comm.py(137)_reader_main()
-> self.out_of_band(data)
(Pdb) n
> /usr/lib/python3/dist-packages/oslo_privsep/comm.py(134)_reader_main()
-> for msg in reader:
(Pdb) n
ValueError: ('%s exceeds max_str_len(%s)', 1482520, 1048576)

https://git.launchpad.net/ubuntu/+source/python-msgpack/tree/msgpack/unpack.h?h=ubuntu/focal#n229
        PyErr_Format(PyExc_ValueError, "%u exceeds max_str_len(%zd)", l, u->max_str_len);
https://github.com/msgpack/msgpack-python/blob/v0.6.2/msgpack/unpack.h#L226-L229

* The above exception never gets logged by neutron-openvswitch-agent and the requesting thread is stuck while waiting for a result from a Future object.