eventlet monkey-patching breaks AMQP heartbeat on uWSGI

Bug #1825584 reported by iain MacDonnell on 2019-04-19
26
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Undecided
Unassigned

Bug Description

Stein nova-api running under uWSGI presents an AMQP issue. The first API call that requires RPC creates an AMQP connection and successfully completes. Normally regular heartbeats would be sent from this point on, to maintain the connection. This is not happening. After a few minutes, the AMQP server (rabbitmq, in my case) notices that there have been no heartbeats, and drops the connection. A later nova API call that requires RPC tries to use the old connection, and throws a "connection reset by peer" exception and the API call fails. A mailing-list response suggests that this is affecting mod_wsgi also:

http://lists.openstack.org/pipermail/openstack-discuss/2019-April/005310.html

I've discovered that this problem seems to be caused by eventlet monkey-patching, which was introduced in:

https://github.com/openstack/nova/commit/23ba1c690652832c655d57476630f02c268c87ae

It was later rearranged in:

https://github.com/openstack/nova/commit/3c5e2b0e9fac985294a949852bb8c83d4ed77e04

but this problem remains.

If I comment out the import of nova.monkey_patch in nova/api/openstack/__init__.py the problem goes away.

Seems that eventlet monkey-patching and uWSGI are not getting along for some reason...

description: updated
Lee Yarwood (lyarwood) wrote :

https://review.opendev.org/#/c/647310/ hasn't landed in stable/stein yet, have you tested with it applied?

Matthew Booth (mbooth-9) wrote :

I'm guessing so, because he mentions patching nova.monkey_patch, which is only introduced in that change.

The mod_wsgi sleuthing in the linked ML post is excellent, btw. However, note that the only reason we continue to do eventlet monkey patching for wsgi callers is that we unfortunately still require it: it has an explicit caller in multi-cell instance list, and possibly other places. Nobody likes this, and I think Mel Witt was working on patches to fix it, but until we get rid of it the focus needs to be on running wsgi such that it's not broken with eventlet: there's not a lot we can do if mod_wsgi (or uwsgi if it turns out to work the same way) isn't running us at all.

Can mod_wsgi be configured such that it *will* run timers?

iain MacDonnell (imacdonn) wrote :

Yes, although that change is not in stable/stein yet, apparently it has been back-ported in RDO. I tried reverting it, but the problem persisted, until I also reverted the change that originally added monkey-patching to API.

Note that I am using uWSGI, not mod_wsgi. It's not yet completely clear to me if mod_wsgi has the same issue, or if the other case has a different root-cause.

iain MacDonnell (imacdonn) wrote :

Reproduced the problem (easily) with mod_wsgi, and confirmed that removing the monkey-patching "fixes" it.

iain MacDonnell (imacdonn) wrote :

Also, FWIW (probably not much), eventlet.monkey_patch(thread=False) seems to allow it to work OK.

Damien Ciabrini (dciabrin) wrote :

As a follow up of the the thread on the ML [1], it appears that the AMQP hearbeat issues that we were seeing in other OpenStack services running under Apache mod_wsgi were due to a change in container healthcheck.

In Stein, TripleO configures the interval between two healthchecks to be 60s + random(45)s, which is too long to ensure that mod_wsgi can always schedule eventlet timers and answers to AMQP heartbeat packets in a timely manner.

The work to fix that specific problem is tracked in https://bugs.launchpad.net/tripleo/+bug/1826281

[1] http://lists.openstack.org/pipermail/openstack-discuss/2019-April/005310.html

Damien Ciabrini (dciabrin) wrote :

Actually I'm taking comment #6 back... the issue I'm experiencing when running nova under mod_wsgi is exactly the one reported here.

Running the AMQP heartbeat thread under mod_wsgi doesn't work when the threading library is monkey_patched, because the thread waits on a data structure [1] that has been monkey patched [2], which makes it yield its execution instead of sleeping for 15s.

Because mod_wsgi stops the execution of its embedded interpreter, the AMQP heartbeat thread can't be resumed until there's a message to be processed in the mod_wsgi queue, which would resume the python interpreter and make eventlet resume the thread.

Disabling monkey-patching in nova_api makes the scheduling issue go away.
Note: other services like heat-api do not use monkey patching and aren't affected, so this seem to confirm that monkey-patching shouldn't happen in nova_api running under mod_wsgi in the first place.

[1] https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L904
[2] https://github.com/openstack/oslo.utils/blob/master/oslo_utils/eventletutils.py#L182

Changed in nova:
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers