A privsep daemon spawned by neutron-openvswitch-agent hangs when debug logging is enabled (large number of registered NICs) - an RPC response is too large for msgpack

Bug #1896734 reported by Dmitrii Shcherbakov
18
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Neutron Open vSwitch Charm
Invalid
Undecided
Unassigned
Ubuntu Cloud Archive
Fix Released
Medium
Unassigned
Ussuri
Fix Released
Medium
Unassigned
Victoria
Fix Released
Medium
Unassigned
neutron
Fix Released
Medium
Rodolfo Alonso
oslo.privsep
Fix Released
Undecided
Unassigned
neutron (Ubuntu)
Fix Released
Medium
Unassigned
Focal
Fix Released
Medium
Unassigned
Groovy
Fix Released
Medium
Unassigned
Hirsute
Fix Released
Medium
Unassigned
Jammy
New
Undecided
Unassigned
python-oslo.privsep (Ubuntu)
New
Undecided
Unassigned
Focal
New
Undecided
Unassigned
Jammy
New
Undecided
Unassigned

Bug Description

[Impact]

When there is a large amount of netdevs registered in the kernel and debug logging is enabled, neutron-openvswitch-agent and the privsep daemon spawned by it hang since the RPC call result sent by the privsep daemon over a unix socket exceeds the message sizes that the msgpack library can handle.

The impact of this is that enabling debug logging on the cloud completely stalls neutron-openvswitch-agents and makes them "dead" from the Neutron server perspective.

The issue is summarized in detail in comment #5 https://bugs.launchpad.net/oslo.privsep/+bug/1896734/comments/5

[Test Plan]

  * deploy Openstack Train/Ussuri/Victoria
  * need at least one compute host
  * enable neutron debug logging
  * create a load of interfaces on your compute host to create a large 'ip addr show' output
  * for ((i=0;i<400;i++)); do ip tuntap add mode tap tap-`uuidgen| cut -c1-11`; done
  * create a single vm
  * add floating ip
  * ping fip
  * create 20 ports and attach them to the vm
  * for ((i=0;i<20;i++)); do id=`uuidgen`; openstack port create --network private --security-group __SG__ X-$id; openstack server add port __VM__ X-$id; done
  * attaching ports should not result in errors

[Where problems could occur]

No problems anticipated this patchset.

========================================================================

When there is a large amount of netdevs registered in the kernel and debug logging is enabled, neutron-openvswitch-agent and the privsep daemon spawned by it hang since the RPC call result sent by the privsep daemon over a unix socket exceeds the message sizes that the msgpack library can handle.

The impact of this is that enabling debug logging on the cloud completely stalls neutron-openvswitch-agents and makes them "dead" from the Neutron server perspective.

The issue is summarized in detail in comment #5 https://bugs.launchpad.net/oslo.privsep/+bug/1896734/comments/5

========================================================================
Old Description

While trying to debug a different issue, I encountered a situation where privsep hangs in the process of handling a request from neutron-openvswitch-agent when debug logging is enabled (juju debug-log neutron-openvswitch=true):

https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/11
https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/12

The issue gets reproduced reliably in the environment where I encountered it on all units. As a result, neutron-openvswitch-agent services hang while waiting for a response from the privsep daemon and do not progress past basic initialization. They never post any state back to the Neutron server and thus are marked dead by it.

The processes though are shown as "active (running)" by systemd which adds to the confusion since they do indeed start from the systemd's perspective.

systemctl --no-pager status neutron-openvswitch-agent.service
● neutron-openvswitch-agent.service - Openstack Neutron Open vSwitch Plugin Agent
   Loaded: loaded (/lib/systemd/system/neutron-openvswitch-agent.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2020-09-23 08:28:41 UTC; 25min ago
 Main PID: 247772 (/usr/bin/python)
    Tasks: 4 (limit: 9830)
   CGroup: /system.slice/neutron-openvswitch-agent.service
           ├─247772 /usr/bin/python3 /usr/bin/neutron-openvswitch-agent --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/plugins/ml2/openvswitch_…og
           └─248272 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini -…ck

--------------------------------------------------------

An strace shows that the privsep daemon tries to receive input from fd 3 which is the unix socket it uses to communicate with the client. However, this is just one tread out of many spawned by the privsep daemon so it is unlikely to be the root cause (there are 65 threads there in total, see https://paste.ubuntu.com/p/fbGvN2P8rP/)

# there is one extra neutron-openvvswitch-agent running in a LXD container which can be ignored here (there is an octavia unit on the node which has a neutron-openvswitch subordinate)
root@node2:~# ps -eo pid,user,args --sort user | grep -P 'privsep.*openvswitch'
 860690 100000 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmp910qakfk/privsep.sock
 248272 root /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmpcmwn7vom/privsep.sock
 363905 root grep --color=auto -P privsep.*openvswitch

root@node2:~# strace -f -p 248453 2>&1
[pid 248786] futex(0x7f6a6401c1d0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 248475] futex(0x7f6a6c024590, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 248473] futex(0x7f6a746d9fd0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 248453] recvfrom(3,

root@node2:~# lsof -p 248453 | grep 3u
privsep-h 248453 root 3u unix 0xffff8e6d8abdec00 0t0 356522977 type=STREAM

root@node2:~# ss -pax | grep 356522977
u_str ESTAB 0 0 /tmp/tmp2afa3enn/privsep.sock 356522978
                                  * 356522977 users:(("/usr/bin/python",pid=247567,fd=16))
u_str ESTAB 0 0 * 356522977
                                  * 356522978 users:(("privsep-helper",pid=248453,fd=3))

root@node2:~# lsof -p 247567 | grep 16u
/usr/bin/ 247567 neutron 16u unix 0xffff8e6d8abdb400 0t0 356522978 /tmp/tmp2afa3enn/privsep.sock type=STREAM

description: updated
description: updated
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

While the below bugs describe similar symptoms, they describe the case where the privsep daemon is forked from the client process

https://bugs.launchpad.net/oslo.privsep/+bug/1887506
https://bugzilla.redhat.com/show_bug.cgi?id=1862364

while in our case it gets started via the /usr/bin/privsep-helper script in a child process:

CGroup: /system.slice/neutron-openvswitch-agent.service
           ├─108898 /usr/bin/python3 /usr/bin/neutron-openvswitch-agent --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/plugins/ml2/openvswitch_age
           └─109233 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --pr

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

Ultimately, things hang when the logger attempts to access log handlers:

(Pdb) where
  /usr/bin/privsep-helper(10)<module>()
-> sys.exit(helper_main())
  /usr/lib/python3/dist-packages/oslo_privsep/daemon.py(558)helper_main()
-> Daemon(channel, context).run()
  /usr/lib/python3/dist-packages/oslo_privsep/daemon.py(370)run()
-> self.loop()
  /usr/lib/python3/dist-packages/oslo_privsep/daemon.py(512)loop()
-> future.add_done_callback(self._create_done_callback(msgid))
  /usr/lib/python3.6/concurrent/futures/_base.py(403)add_done_callback()
-> fn(self)
  /usr/lib/python3/dist-packages/oslo_privsep/daemon.py(476)_call_back()
-> {'msgid': msgid, 'reply': reply})
  /usr/lib/python3.6/logging/__init__.py(1630)debug()
-> self.log(DEBUG, msg, *args, **kwargs)
  /usr/lib/python3.6/logging/__init__.py(1674)log()
-> self.logger.log(level, msg, *args, **kwargs)
  /usr/lib/python3.6/logging/__init__.py(1374)log()
-> self._log(level, msg, args, **kwargs)
  /usr/lib/python3.6/logging/__init__.py(1444)_log()
-> self.handle(record)
  /usr/lib/python3.6/logging/__init__.py(1454)handle()
-> self.callHandlers(record)
> /usr/lib/python3.6/logging/__init__.py(1513)callHandlers()
-> for hdlr in c.handlers:

> /usr/lib/python3.6/logging/__init__.py(1454)handle()
-> self.callHandlers(record)
(Pdb) s
--Call--
> /usr/lib/python3.6/logging/__init__.py(1500)callHandlers()
-> def callHandlers(self, record):
(Pdb) n
> /usr/lib/python3.6/logging/__init__.py(1510)callHandlers()
-> c = self
(Pdb) l
1505 logger hierarchy. If no handler was found, output a one-off error
1506 message to sys.stderr. Stop searching up the hierarchy whenever a
1507 logger with the "propagate" attribute set to zero is found - that
1508 will be the last logger whose handlers are called.
1509 """
1510 -> c = self
1511 found = 0
1512 while c:
1513 for hdlr in c.handlers:
1514 found = found + 1
1515 if record.levelno >= hdlr.level:
(Pdb) n
> /usr/lib/python3.6/logging/__init__.py(1511)callHandlers()
-> found = 0
(Pdb) n
> /usr/lib/python3.6/logging/__init__.py(1512)callHandlers()
-> while c:
(Pdb) n
> /usr/lib/python3.6/logging/__init__.py(1513)callHandlers()
-> for hdlr in c.handlers:
(Pdb) c.handlers
<HANGS HERE>

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

After a redeployment the issue could no longer be reproduced. Let's see if it is a temporal condition and if it reappears in the future.

Changed in charm-neutron-openvswitch:
status: New → Incomplete
Changed in charm-neutron-openvswitch:
status: Incomplete → New
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

The issue got reproduced again.

The problem with accessing c.handlers was something else - in the (r)pdb debugger "c" is interpreted as "continue" so c.handlers was interpreted as "continue" - !c.handlers has to be used instead.

So the issue happens down the line when the PrivsepLogHandler tries to send data over a unix socket back to the client (/tmp/tmpn6wjd_lk/privsep.sock). The socket file does not exist in the file system which is normal considering it gets opened for a short period of time so that processes (client & privsep) can get a handle to it and then gets deleted (while being kept alive by the kernel while it is open).

https://github.com/openstack/oslo.privsep/blob/stable/ussuri/oslo_privsep/daemon.py#L190

I can see that logging locks get acquired just fine so this isn't a contention between threads (the ThreadPoolExecutor was also modified in-place to only have 1 thread to prove this).

...
862 if rv:
863 -> self.acquire()
864 try:
865 self.emit(record)
866 finally:
867 self.release()
868 return rv
(Pdb) dir(self.lock)
['__class__', '__delattr__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_acquire_restore', '_is_owned', '_release_save', 'acquire', 'release']
(Pdb) self.lock
<unlocked _thread.RLock object owner=0 count=0 at 0x7f9f29aaa120>

> /usr/lib/python3/dist-packages/oslo_privsep/daemon.py(170)emit()
-> self.channel.send((None, (Message.LOG, data)))
(Pdb) l
165
166 # serialise msg now so we can drop (potentially unserialisable) args
167 data['msg'] = record.getMessage()
168 data['args'] = ()
169
170 -> self.channel.send((None, (Message.LOG, data)))
171
172
173 class _ClientChannel(comm.ClientChannel):
174 """Our protocol, layered on the basic primitives in comm.ClientChannel"""
175
(Pdb) self.channel
<oslo_privsep.comm.ServerChannel object at 0x7f9f29a98cf8>

(Pdb) self.channel.wlock
<unlocked _thread.lock object at 0x7f9f29af0800>
(Pdb) self.channel.rlock
<unlocked _thread.lock object at 0x7f9f29af0fd0>

And here is the place where the execution halts:

(Pdb) self.writesock
<socket.socket fd=3, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0, raddr=/tmp/tmpn6wjd_lk/privsep.sock>

(Pdb) l
 51 self.writesock = writesock
 52
 53 def send(self, msg):
 54 buf = msgpack.packb(msg, use_bin_type=True,
 55 unicode_errors='surrogateescape')
 56 -> self.writesock.sendall(buf)
 57
 58 def close(self):
 59 # Hilarious. `socket._socketobject.close()` doesn't actually
 60 # call `self._sock.close()`. Oh well, we really wanted a half
 61 # close anyway.

(Pdb) s

<hangs here>

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :
Download full text (3.6 KiB)

~nikolay.vinogradov and I found the root cause after a long debugging session.

What we have:

* a bionic-ussuri deployment with the HWE (focal 5.4) kernel;

* 2 processes communicating over a type=STREAM unix socket (blocking);
* Unix sockets have max buffer limits which means the amount of data that can be written there before a blocking occurs is far from infinite: https://elixir.bootlin.com/linux/v5.4.67/source/include/net/sock.h#L2557

* the privsep daemon process is the server which responds to a client (neutron-openvswitch-agent) request "get_devices_with_ip"

   * File "/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 1364, in get_devices_with_ip
   * File "/usr/lib/python3/dist-packages/oslo_privsep/daemon.py", line 195, in remote_call

* when debug logging is enabled, the privsep daemon returns a log message to be printed by the caller to its log which includes the full reply;

* the client has a separate reader thread which reads messages from the unix socket in a loop and sets results for previously created Futures associated with message ids. The original thread that initiates a request waits for a result associated with the Future to appear;

https://github.com/openstack/oslo.privsep/blob/stable/ussuri/oslo_privsep/comm.py#L120-L124 (thread creation)
https://github.com/openstack/oslo.privsep/blob/stable/ussuri/oslo_privsep/comm.py#L130-L155 (the thread's main function)

https://github.com/openstack/oslo.privsep/blob/stable/ussuri/oslo_privsep/comm.py#L172 (the code that waits for the result of a future to appear).

* msgpack is used for serialization of data sent of the unix socket;

* the node has hardware offloading enabled and there are many network interfaces registered with the kernel. The `ip addr show` output alone has 764 lines:

root@node2:~# ip a s | wc -l
764

* The log message that the privsep daemon attempts to send is 1483120 bytes = 1.414 MiB (not counting the overhead).

https://paste.ubuntu.com/p/Sf2wrcpWvB/
(Pdb) where
  /usr/bin/privsep-helper(10)<module>()
-> sys.exit(helper_main())
  /usr/lib/python3/dist-packag
  /usr/lib/python3/dist-packages/oslo_privsep/daemon.py(170)emit()
-> self.channel.send((None, (Message.LOG, data)))
> /usr/lib/python3/dist-packages/oslo_privsep/comm.py(202)send()
-> self.writer.send(msg)

(Pdb) len(str(msg))
1483120

* The client reader thread tries to receive this message over the unix socket but receives an error from msgpack which does not like the message size:

(Pdb) l
131 import rpdb ; rpdb.set_trace()
132 with self.lock:
133 self.running = True
134 for msg in reader:
135 msgid, data = msg
136 -> if msgid is None:
137 self.out_of_band(data)
138 else:
139 with self.lock:
140 if msgid not in self.outstanding_msgs:
141 raise AssertionError("msgid should in "
(Pdb) n
> /usr/lib/python3/dist-packages/oslo_privsep/comm.py(137)_reader_main()
-> self.out_of_band(data)
(Pdb) n
> /usr/lib/python3/dist-packages/oslo_privsep/comm.py(134)_reader_main()
-> for msg in reader:
(Pdb) n...

Read more...

summary: - A privsep daemon hangs when debug logging is enabled
+ A privsep daemon spawned by neutron-openvswitch-agent hangs when debug
+ logging is enabled (large number of registered NICs) - an RPC response
+ is too large for msgpack
Changed in charm-neutron-openvswitch:
status: New → Invalid
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

ip address show output on one of the affected nodes (for reference on how large it is):

https://paste.ubuntu.com/p/cRf655Y8kt/

description: updated
Trent Lloyd (lathiat)
tags: added: seg
Changed in neutron:
importance: Undecided → Medium
assignee: nobody → Rodolfo Alonso (rodolfo-alonso-hernandez)
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :
Revision history for this message
Edward Hope-Morley (hopem) wrote :

Neutron fix is released upstream and backported all the way to Train.

Changed in neutron:
status: New → Fix Released
Changed in neutron (Ubuntu Focal):
importance: Undecided → Medium
status: New → Triaged
Changed in neutron (Ubuntu Groovy):
importance: Undecided → Medium
status: New → Triaged
Changed in neutron (Ubuntu Hirsute):
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package neutron - 2:17.1.0+git2021012815.0fb63f7297-0ubuntu5

---------------
neutron (2:17.1.0+git2021012815.0fb63f7297-0ubuntu5) hirsute; urgency=medium

  * d/p/improve-get-devices-with-ip-performance.patch: Performance of
    get_devices_with_ip is improved to limit the amount of information
    to be sent and reduce the number of syscalls. (LP: #1896734).

 -- Corey Bryant <email address hidden> Tue, 09 Mar 2021 16:01:26 -0500

Changed in neutron (Ubuntu Hirsute):
status: Triaged → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Please test proposed package

Hello Dmitrii, or anyone else affected,

Accepted neutron into groovy-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/neutron/2:17.1.0-0ubuntu3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-groovy to verification-done-groovy. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-groovy. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in neutron (Ubuntu Groovy):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-groovy
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Hello Dmitrii, or anyone else affected,

Accepted neutron into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/neutron/2:16.3.0-0ubuntu3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in neutron (Ubuntu Focal):
status: Triaged → Fix Committed
tags: added: verification-needed-focal
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hello Dmitrii, or anyone else affected,

Accepted neutron into victoria-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:victoria-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-victoria-needed to verification-victoria-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-victoria-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hello Dmitrii, or anyone else affected,

Accepted neutron into ussuri-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:ussuri-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-ussuri-needed to verification-ussuri-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-ussuri-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-victoria-needed
Revision history for this message
Edward Hope-Morley (hopem) wrote :

Verified Groovy using [Test Plan]:

# apt-cache policy neutron-common
neutron-common:
  Installed: 2:17.1.0-0ubuntu3
  Candidate: 2:17.1.0-0ubuntu3
  Version table:
 *** 2:17.1.0-0ubuntu3 500
        500 http://archive.ubuntu.com/ubuntu groovy-proposed/main amd64 Packages
        100 /var/lib/dpkg/status
     2:17.0.0-0ubuntu3 500
        500 http://nova.clouds.archive.ubuntu.com/ubuntu groovy-updates/main amd64 Packages
     2:17.0.0-0ubuntu1 500
        500 http://nova.clouds.archive.ubuntu.com/ubuntu groovy/main amd64 Packages

description: updated
tags: added: verification-done-groovy
removed: verification-needed-groovy
description: updated
Revision history for this message
Edward Hope-Morley (hopem) wrote :

Verified Focal Victoria (UCA) using [Test Plan]:

# apt-cache policy neutron-common
neutron-common:
  Installed: 2:17.1.0-0ubuntu3~cloud0
  Candidate: 2:17.1.0-0ubuntu3~cloud0
  Version table:
 *** 2:17.1.0-0ubuntu3~cloud0 500
        500 http://ubuntu-cloud.archive.canonical.com/ubuntu focal-proposed/victoria/main amd64 Packages
        100 /var/lib/dpkg/status
     2:16.2.0-0ubuntu3 500
        500 http://nova.clouds.archive.ubuntu.com/ubuntu focal-updates/main amd64 Packages
     2:16.0.0~b3~git2020041516.5f42488a9a-0ubuntu2 500
        500 http://nova.clouds.archive.ubuntu.com/ubuntu focal/main amd64 Packages

tags: added: verification-victoria-done
removed: verification-victoria-needed
Revision history for this message
Edward Hope-Morley (hopem) wrote :

Verified Focal Ussuri using [Test Plan]:

# apt-cache policy neutron-common
neutron-common:
  Installed: 2:16.3.0-0ubuntu3
  Candidate: 2:16.3.0-0ubuntu3
  Version table:
 *** 2:16.3.0-0ubuntu3 500
        500 http://archive.ubuntu.com/ubuntu focal-proposed/main amd64 Packages
        100 /var/lib/dpkg/status
     2:16.2.0-0ubuntu3 500
        500 http://nova.clouds.archive.ubuntu.com/ubuntu focal-updates/main amd64 Packages
     2:16.0.0~b3~git2020041516.5f42488a9a-0ubuntu2 500
        500 http://nova.clouds.archive.ubuntu.com/ubuntu focal/main amd64 Packages

tags: added: verification-done-focal
removed: verification-needed-focal
tags: added: verification-done
removed: verification-needed
Revision history for this message
Brian Murray (brian-murray) wrote : Update Released

The verification of the Stable Release Update for neutron has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package neutron - 2:16.3.0-0ubuntu3

---------------
neutron (2:16.3.0-0ubuntu3) focal; urgency=medium

  * d/p/revert-dvr-remove-control-plane-arp-updates.patch: Cherry-picked
    from https://review.opendev.org/c/openstack/neutron/+/777903 to prevent
    permanent arp entries that never get deleted (LP: #1916761).
  * d/p/improve-get-devices-with-ip-performance.patch: Performance of
    get_devices_with_ip is improved to limit the amount of information
    to be sent and reduce the number of syscalls. (LP: #1896734).

neutron (2:16.3.0-0ubuntu2) focal; urgency=medium

  * Backport fix for dvr-snat missig rfp interfaces (LP: #1894843)
    - d/p/0001-Fix-deletion-of-rfp-interfaces-when-router-is-re-ena.patch

neutron (2:16.3.0-0ubuntu1) focal; urgency=medium

  * d/watch: Add trailing slash to Neutron URL.
  * New stable point release for OpenStack Ussuri (LP: #1915786).
  * d/p/fix-removal-of-dvr-src-mac-flows.patch,
    d/p/ovn-fix-inconsistent-igmp-configuration.patch: Removed after patch landed
    upstream.

 -- Corey Bryant <email address hidden> Mon, 08 Mar 2021 13:26:42 -0500

Changed in neutron (Ubuntu Focal):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package neutron - 2:17.1.0-0ubuntu3

---------------
neutron (2:17.1.0-0ubuntu3) groovy; urgency=medium

  * d/p/revert-dvr-remove-control-plane-arp-updates.patch: Cherry-picked
    from https://review.opendev.org/c/openstack/neutron/+/777903 to prevent
    permanent arp entries that never get deleted (LP: #1916761).
  * d/p/improve-get-devices-with-ip-performance.patch: Performance of
    get_devices_with_ip is improved to limit the amount of information
    to be sent and reduce the number of syscalls. (LP: #1896734).

neutron (2:17.1.0-0ubuntu2) groovy; urgency=medium

  * Backport fix for dvr-snat missig rfp interfaces (LP: #1894843)
    - d/p/0001-Fix-deletion-of-rfp-interfaces-when-router-is-re-ena.patch

neutron (2:17.1.0-0ubuntu1) groovy; urgency=medium

  * d/watch: Fix typo in watch URL, add trailing slash.
  * New stable point release for OpenStack Victoria (LP: #1915785).
  * d/p/fix-removal-of-dvr-src-mac-flows.patch,
    d/p/ovn-fix-inconsistent-igmp-configuration.patch: Removed after fix landed
    upstream.

 -- Corey Bryant <email address hidden> Mon, 08 Mar 2021 13:00:21 -0500

Changed in neutron (Ubuntu Groovy):
status: Fix Committed → Fix Released
Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

The verification of the Stable Release Update for neutron has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

This bug was fixed in the package neutron - 2:17.1.0-0ubuntu3~cloud0
---------------

 neutron (2:17.1.0-0ubuntu3~cloud0) focal-victoria; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 neutron (2:17.1.0-0ubuntu3) groovy; urgency=medium
 .
   * d/p/revert-dvr-remove-control-plane-arp-updates.patch: Cherry-picked
     from https://review.opendev.org/c/openstack/neutron/+/777903 to prevent
     permanent arp entries that never get deleted (LP: #1916761).
   * d/p/improve-get-devices-with-ip-performance.patch: Performance of
     get_devices_with_ip is improved to limit the amount of information
     to be sent and reduce the number of syscalls. (LP: #1896734).

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

The verification of the Stable Release Update for neutron has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

This bug was released into the Victoria cloud archive but a hiccup in the automatic backport process from distro caused this change notification to get lost.

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

ah, wrong bug, ignore the last comment :)

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

Comment 11 seems to have added incorrect tags for verification for the Ussuri cloud archive.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-ussuri-needed to verification-ussuri-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-ussuri-failed. In either case, details of your testing will help us make a better decision.

tags: added: verification-ussuri-needed
Revision history for this message
Edward Hope-Morley (hopem) wrote :

Verified Bionic Ussuri using [Test Plan]:

# apt-cache policy neutron-common
neutron-common:
  Installed: 2:16.3.0-0ubuntu3~cloud0
  Candidate: 2:16.3.0-0ubuntu3~cloud0
  Version table:
 *** 2:16.3.0-0ubuntu3~cloud0 500
        500 http://ubuntu-cloud.archive.canonical.com/ubuntu bionic-proposed/ussuri/main amd64 Packages
        100 /var/lib/dpkg/status
     2:12.1.1-0ubuntu3 500
        500 http://nova.clouds.archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages
     2:12.0.1-0ubuntu1 500
        500 http://nova.clouds.archive.ubuntu.com/ubuntu bionic/main amd64 Packages

tags: added: verification-ussuri-done
removed: verification-ussuri-needed
Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

The verification of the Stable Release Update for neutron has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

This bug was fixed in the package neutron - 2:16.3.0-0ubuntu3~cloud0
---------------

 neutron (2:16.3.0-0ubuntu3~cloud0) bionic-ussuri; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 neutron (2:16.3.0-0ubuntu3) focal; urgency=medium
 .
   * d/p/revert-dvr-remove-control-plane-arp-updates.patch: Cherry-picked
     from https://review.opendev.org/c/openstack/neutron/+/777903 to prevent
     permanent arp entries that never get deleted (LP: #1916761).
   * d/p/improve-get-devices-with-ip-performance.patch: Performance of
     get_devices_with_ip is improved to limit the amount of information
     to be sent and reduce the number of syscalls. (LP: #1896734).

Changed in cloud-archive:
status: Fix Committed → Fix Released
Revision history for this message
Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package neutron - 2:17.1.0+git2021012815.0fb63f7297-0ubuntu5~cloud0
---------------

 neutron (2:17.1.0+git2021012815.0fb63f7297-0ubuntu5~cloud0) focal-wallaby; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 neutron (2:17.1.0+git2021012815.0fb63f7297-0ubuntu5) hirsute; urgency=medium
 .
   * d/p/improve-get-devices-with-ip-performance.patch: Performance of
     get_devices_with_ip is improved to limit the amount of information
     to be sent and reduce the number of syscalls. (LP: #1896734).

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/stein)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: stable/stein
Review: https://review.opendev.org/c/openstack/neutron/+/779659
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
Brian Murray (brian-murray) wrote :

The Groovy Gorilla has reached end of life, so this bug will not be fixed for that release

Changed in python-oslo.privsep (Ubuntu Groovy):
status: New → Won't Fix
Revision history for this message
Mohammed Naser (mnaser) wrote :

Unfortunately, I'm running into this issue right now in an environment that is running `linuxbridge`. I'm going to assume the number of `VXLAN` interfaces is bubbling up that is causing this issue to occur. Also, since in linuxbridge, there are `brq` and `VXLAN` interfaces.

I've got around 219 interfaces in this scenario:

root@tctrko1:~# ip link | grep qlen | wc -l
219

Revision history for this message
Mohammed Naser (mnaser) wrote :
Download full text (4.2 KiB)

The stack trace is the following:

Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: ------ Green Thread ------
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: /openstack/venvs/neutron-21.2.6/bin/neutron-linuxbridge-agent:8 in <module>
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: `sys.exit(main())`
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: /openstack/venvs/neutron-21.2.6/lib/python3.6/site-packages/neutron/cmd/eventlet/plugins/linuxbridge_neutron_agent.py:21 in main
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: `agent_main.main()`
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: /openstack/venvs/neutron-21.2.6/lib/python3.6/site-packages/neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py:1055 in main
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: `manager = LinuxBridgeManager(bridge_mappings, interface_mappings)`
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: /openstack/venvs/neutron-21.2.6/lib/python3.6/site-packages/neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py:93 in __init__
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: `self.check_vxlan_support()`
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: /openstack/venvs/neutron-21.2.6/lib/python3.6/site-packages/neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py:745 in check_vxlan_sup
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: `if self.vxlan_ucast_supported():`
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: /openstack/venvs/neutron-21.2.6/lib/python3.6/site-packages/neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py:706 in vxlan_ucast_sup
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: `ip_lib.vxlan_in_use(seg_id)):`
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: /openstack/venvs/neutron-21.2.6/lib/python3.6/site-packages/neutron/agent/linux/ip_lib.py:741 in vxlan_in_use
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: `interfaces = get_devices_info(namespace)`
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: /openstack/venvs/neutron-21.2.6/lib/python3.6/site-packages/neutron/agent/linux/ip_lib.py:1379 in get_devices_info
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: `devices = privileged.get_link_devices(namespace, **kwargs)`
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: /openstack/venvs/neutron-21.2.6/lib/python3.6/site-packages/oslo_privsep/priv_context.py:247 in _wrap
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: `return self.channel.remote_call(name, args, kwargs)`
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: /openstack/venvs/neutron-21.2.6/lib/python3.6/site-packages/oslo_privsep/daemon.py:214 in remote_call
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: `result = self.send_recv((Message.CALL.value, name, args, kwargs))`
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: /openstack/venvs/neutron-21.2.6/lib/python3.6/site-packages/oslo_privsep/comm.py:172 in send_recv
D...

Read more...

Revision history for this message
Mohammed Naser (mnaser) wrote :

I logged it locally, and this is how much data was put out:

root@tctrko1:~# cat /tmp/debug | wc -c
1179958

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.privsep (master)
Changed in oslo.privsep:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.privsep (master)

Reviewed: https://review.opendev.org/c/openstack/oslo.privsep/+/819996
Committed: https://opendev.org/openstack/oslo.privsep/commit/c223dbced7d5a8d1920fe764cbce42cf844538e1
Submitter: "Zuul (22348)"
Branch: master

commit c223dbced7d5a8d1920fe764cbce42cf844538e1
Author: Mohammed Naser <email address hidden>
Date: Wed Dec 1 11:19:26 2021 +0400

    Bump max_buffer_size for Deserializer

    Since msgpack 0.6.0, some limits were introduced for the
    deserializer which were put in to avoid any denial of service
    attacks using msgpack. These limits were raised to 100MiB
    in the release of msgpack 1.0.0.

    The default buffer sizes that were implemented were quite low
    and when running certain `privsep` commands, especially for
    Neutron when using linux bridge, where there is a large amount
    of netdevs, privsep would crash since msgpack would fail to
    decode the message since it considers it too big:

      ValueError: 1174941 exceeds max_str_len(1048576)

    In this commit, the `max_buffer_size` is bumped to the value
    that ships with msgpack==1.0.0 to allow for users who don't
    have that to continue to function. Also, since `msgpack` is
    only being used by the internal API, we're not worried about
    a third party coming in and overwhelming the system by
    deserializing calls.

    This fix also addresses some weird behaviour where privsep
    will die and certain OpenStack agents would start to behave
    in a strange way once they hit a certain number of ports (since
    any privsep calls would start to fail).

    Closes-Bug: #1844822
    Closes-Bug: #1896734
    Related-Bug: #1928764
    Closes-Bug: #1952611
    Change-Id: I135917522daff95377d07566317ef0fc0d16e7cb

Changed in oslo.privsep:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.privsep 2.8.0

This issue was fixed in the openstack/oslo.privsep 2.8.0 release.

Revision history for this message
Edward Hope-Morley (hopem) wrote :

The scenario discussed in this bug so far was in the context of neutron-openvswitch-agent and a neutron patch resolved that aspect but the problem has resurfaced in environments using neutron-l3-agent with very large numbers of ports and the cause/fix is as follows:

msgpack patch [1] landed in version v1.0.0 and made the following changes:
   * set Unpacker() default max_buffer_size to 1M (was 0)
   * set max_buffer_size to INT_MAX i.e. 2147483647 (1G) if max_buffer_size is set to 0
   * set indiviual buffer sizes to max_buffer_size

Prior to this (e.g. in v0.6.2 which ships with Openstack Ussuri in Ubuntu Focal) max_buffer_size default was 0 and individal buffers had varying and small defaults (see https://github.com/msgpack/msgpack-python/blob/997b524f06176aaa6bd255a046a8746e99b4f87d/msgpack/_unpacker.pyx#L368). They were also set to a default if max_buffer_size was 0 but *before* max_buffer_size was set to INT_MAX if it was also 0.

This meant that oslo.privsep in Openstack Ussuri, which does not set a max_buffer_size, was ending up with e.g. max_str_len of 1024 * 1024 (1M) which is a limit it easily hits on hosts with large numbers of ports and the result being that oslo.privsep crashes with a log like:

$ journalctl -D /var/log/journal/ --unit neutron-l3-agent --grep ValueError
-- Boot 8fde93ff89f443f58f2a1d3307dad1d3 --
Jul 22 12:16:58 ubuntu neutron-l3-agent[32426]: ValueError: 1050710 exceeds max_str_len(1048576)

Note that this log is only logged to the journal and not neutron-l3-agent.log since it is happening in privsep.

Patch [2] landed in oslo.privsep 2.8.0 that sets max_buffer_size to 100M. This version of privsep is available with Openstack Zed and beyond.

So in order to fix this for Openstack Ussuri upwards we need to look at backporting patch [2] down to python3-oslo.privsep 2.1.1 (in focal-updates). It is very small patch so I think it should be safe to do.

[1] https://github.com/msgpack/msgpack-python/commit/c356035a576c38db5ca232ede07b291087f1b8b2
[2] https://github.com/openstack/oslo.privsep/commit/c223dbced7d5a8d1920fe764cbce42cf844538e1

no longer affects: python-oslo.privsep (Ubuntu Groovy)
no longer affects: python-oslo.privsep (Ubuntu Hirsute)
Revision history for this message
Edward Hope-Morley (hopem) wrote :

I'm going to propose the oslo.privsep sru in https://bugs.launchpad.net/ubuntu/+source/python-oslo.privsep/+bug/2029952 since this bug was used to sru neutron already

Revision history for this message
Nicolas Bock (nicolasbock) wrote :

I verified Focal / Ussuri

* Install python3-oslo.privsep-2.1.1-0ubuntu1
* Create several VMs
* Create hundreds of networks
* for (( i = 0; i < 20; i++ )); do for (( j = 1; j <= 6; j++ )); do id=$(uuidgen); openstack port create --network private X-${id}; openstack server add port jammy-133848-${j} X-${id}; done; done
* Observe error with journalctl --unit neutron-l3-agent.service --grep ValueError
* Install python3-oslo.privsep-2.1.1-0ubuntu2
* Attach more ports and verify that error is not present anymore.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.