Neutron-server + uwsgi deadlocks whenr unning rpc workers

Bug #2062009 reported by Sebastian Lohff
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Unassigned

Bug Description

In certain situations we observe that neutron-server + uwsgi shares locks between its native threads and its eventlet threads. As eventlet relies on being informed when a lock is released, this may lead to a deadlock, as the evenlet thread waits indefinitely for an already released lock. In our infrastructure this leads to API requests being performed on Neutron side, but the caller never gets a response. On actions like port creations from e.g. Nova or Manila this will lead to orphaned ports, as the implementation will just try again with creating the port.

To better debug this we have reintroduced guru meditation reports into neutron-server[0] and configured uwsgi to send a SIGWINCH on a harakiri[1] to trigger the guru meditation whenever a uwsgi worker deadlocks.

The two most interesting candidates seem to be a shared lock inside oslo_messaging and python's logging lock, which seems to also be called from oslo_messaging. Both cases identified by the traceback seem to point to oslo_messaging and its RPC Server (see attached guru meditation).

As all RPC Servers should run inside neutron-rpc-server anyway (due to the uwsgi/neutron-rpc-server split) we should move these instances over there. This will also fix #1864418. One easy way to find instances of this would be to check via backdoor (or a manual manhole installation, if backdoor is not available) and search instances of oslo_messaging.server.MessageHandlingServer via fo(). In our setup (due to the service_plugins enabled) we see rpc servers running from trunk and logapi:

>>> [ep for mhs in fo(oslo_messaging.server.MessageHandlingServer) for ep in mhs.dispatcher.endpoints]
[<neutron.services.logapi.rpc.server.LoggingApiSkeleton object at 0x7fb0d465ec10>, <neutron.services.trunk.rpc.server.TrunkSkeleton object at 0x7f622ec11cd0>]

The RPC servers should be started via start_rpc_listeners()

Nova has had similar problems with eventlet and logging in the past, see here[2][3]. Tests done with Neutron Yoga (or our own brand stable/yoga-m3), but issue is present in current master.

[0] https://github.com/sapcc/neutron/commit/a7c44263b70089d8106bed6d8d5d0e3ddf44d5ad
[1] https://github.com/sapcc/helm-charts/blob/7a93e91c3af16ad2eb91e0a1d176d56a26faa393/openstack/neutron/templates/etc/_uwsgi.ini.tpl#L46-L50
[2] https://github.com/sapcc/nova/blob/f61bd589796f0cd7ae37683de3d676e5edd9cf80/nova/virt/libvirt/host.py#L197-L201
[3] https://github.com/sapcc/nova/blob/f61bd589796f0cd7ae37683de3d676e5edd9cf80/nova/virt/libvirt/migration.py#L406-L407

Tags: oslo
Revision history for this message
Sebastian Lohff (sebageek) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/916112

Changed in neutron:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/916123

Revision history for this message
Sebastian Lohff (sebageek) wrote :

From a quick grep through the service plugins in current neutron master it looks like trunk and logapi are the only two plugins that currently create a RPC server / consumer outside of a start_rpc_listener() method.

tags: added: oslo
Changed in neutron:
importance: Undecided → High
importance: High → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/916123
Committed: https://opendev.org/openstack/neutron/commit/6170309157d6922db2bc6014f17487b74f37e08b
Submitter: "Zuul (22348)"
Branch: master

commit 6170309157d6922db2bc6014f17487b74f37e08b
Author: Sebastian Lohff <email address hidden>
Date: Mon Apr 15 15:39:27 2024 +0200

    Start logging plugin RPC via service framework

    Instead of the LoggingServiceDriverManager starting the RPC if any
    driver needs it, we now only start it when this is requested by neutron
    via start_rpc_listeners(). This is required when running neutron-server
    and neutron-rpc-server separately to run RPC only in neutron-rpc-server.

    Change-Id: I8d185cdc807e94098c137314bcaa2317a2f85ebe
    Partial-Bug: #2062009

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/neutron/+/916112
Committed: https://opendev.org/openstack/neutron/commit/ffcaeda32adf32388c322cfc6f7a8933ef94d2a9
Submitter: "Zuul (22348)"
Branch: master

commit ffcaeda32adf32388c322cfc6f7a8933ef94d2a9
Author: Sebastian Lohff <email address hidden>
Date: Mon Apr 15 16:14:50 2024 +0200

    Start trunk plugin RPC via service framework

    Instead of each individual driver setting up the RPC server (and setting
    the _rpc_backend attribute on the TrunkPlugin) we now check in the
    TrunkPlugin if any driver requires the RPC backend to be started.
    Additionally, we only start it when this is requested by Neutron via
    start_rpc_listeners(). This is required when running neutron-server and
    neutron-rpc-server separately to run RPC only in neutron-rpc-server.

    As we still need the notifiers of ServerSideRpcBackend to be
    created/started, we separate TrunkSkeleton (which is the RPC server
    implementation) and ServerSideRpcBackend (which is essentially only a
    notifier). In case RPC is required by a driver, we always start the
    notifier, but the RPC server only when requested via
    start_rpc_listeners().

    Change-Id: I2c6362b3320e534a6e65bd7701b5ac2feca42a49
    Closes-Bug: #2015275
    Closes-Bug: #2062009

Changed in neutron:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.