eventlet monkey-patching breaks AMQP heartbeat on uWSGI

Bug #1825584 reported by iain MacDonnell on 2019-04-19
48
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Low
Dan Smith
Stein
Low
sean mooney
Train
Low
sean mooney
Ussuri
Low
sean mooney

Bug Description

Stein nova-api running under uWSGI presents an AMQP issue. The first API call that requires RPC creates an AMQP connection and successfully completes. Normally regular heartbeats would be sent from this point on, to maintain the connection. This is not happening. After a few minutes, the AMQP server (rabbitmq, in my case) notices that there have been no heartbeats, and drops the connection. A later nova API call that requires RPC tries to use the old connection, and throws a "connection reset by peer" exception and the API call fails. A mailing-list response suggests that this is affecting mod_wsgi also:

http://lists.openstack.org/pipermail/openstack-discuss/2019-April/005310.html

I've discovered that this problem seems to be caused by eventlet monkey-patching, which was introduced in:

https://github.com/openstack/nova/commit/23ba1c690652832c655d57476630f02c268c87ae

It was later rearranged in:

https://github.com/openstack/nova/commit/3c5e2b0e9fac985294a949852bb8c83d4ed77e04

but this problem remains.

If I comment out the import of nova.monkey_patch in nova/api/openstack/__init__.py the problem goes away.

Seems that eventlet monkey-patching and uWSGI are not getting along for some reason...

description: updated
Lee Yarwood (lyarwood) wrote :

https://review.opendev.org/#/c/647310/ hasn't landed in stable/stein yet, have you tested with it applied?

Matthew Booth (mbooth-9) wrote :

I'm guessing so, because he mentions patching nova.monkey_patch, which is only introduced in that change.

The mod_wsgi sleuthing in the linked ML post is excellent, btw. However, note that the only reason we continue to do eventlet monkey patching for wsgi callers is that we unfortunately still require it: it has an explicit caller in multi-cell instance list, and possibly other places. Nobody likes this, and I think Mel Witt was working on patches to fix it, but until we get rid of it the focus needs to be on running wsgi such that it's not broken with eventlet: there's not a lot we can do if mod_wsgi (or uwsgi if it turns out to work the same way) isn't running us at all.

Can mod_wsgi be configured such that it *will* run timers?

iain MacDonnell (imacdonn) wrote :

Yes, although that change is not in stable/stein yet, apparently it has been back-ported in RDO. I tried reverting it, but the problem persisted, until I also reverted the change that originally added monkey-patching to API.

Note that I am using uWSGI, not mod_wsgi. It's not yet completely clear to me if mod_wsgi has the same issue, or if the other case has a different root-cause.

iain MacDonnell (imacdonn) wrote :

Reproduced the problem (easily) with mod_wsgi, and confirmed that removing the monkey-patching "fixes" it.

iain MacDonnell (imacdonn) wrote :

Also, FWIW (probably not much), eventlet.monkey_patch(thread=False) seems to allow it to work OK.

Damien Ciabrini (dciabrin) wrote :

As a follow up of the the thread on the ML [1], it appears that the AMQP hearbeat issues that we were seeing in other OpenStack services running under Apache mod_wsgi were due to a change in container healthcheck.

In Stein, TripleO configures the interval between two healthchecks to be 60s + random(45)s, which is too long to ensure that mod_wsgi can always schedule eventlet timers and answers to AMQP heartbeat packets in a timely manner.

The work to fix that specific problem is tracked in https://bugs.launchpad.net/tripleo/+bug/1826281

[1] http://lists.openstack.org/pipermail/openstack-discuss/2019-April/005310.html

Damien Ciabrini (dciabrin) wrote :

Actually I'm taking comment #6 back... the issue I'm experiencing when running nova under mod_wsgi is exactly the one reported here.

Running the AMQP heartbeat thread under mod_wsgi doesn't work when the threading library is monkey_patched, because the thread waits on a data structure [1] that has been monkey patched [2], which makes it yield its execution instead of sleeping for 15s.

Because mod_wsgi stops the execution of its embedded interpreter, the AMQP heartbeat thread can't be resumed until there's a message to be processed in the mod_wsgi queue, which would resume the python interpreter and make eventlet resume the thread.

Disabling monkey-patching in nova_api makes the scheduling issue go away.
Note: other services like heat-api do not use monkey patching and aren't affected, so this seem to confirm that monkey-patching shouldn't happen in nova_api running under mod_wsgi in the first place.

[1] https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L904
[2] https://github.com/openstack/oslo.utils/blob/master/oslo_utils/eventletutils.py#L182

Changed in nova:
status: New → Confirmed

Reviewed: https://review.opendev.org/662095
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a694952eacfe3a2dac34957cf95d5529eb89d4b2
Submitter: Zuul
Branch: stable/stein

commit a694952eacfe3a2dac34957cf95d5529eb89d4b2
Author: melanie witt <email address hidden>
Date: Wed May 29 21:32:11 2019 +0000

    Add reno about nova-api eventlet monkey-patching and rabbitmq

    This adds a known issue release note about eventlet monkey-patching in
    nova-api and workarounds.

    Related-Bug: #1825584
    Related-Bug: #1829062

    Change-Id: I22abd1f5377489dd809eb705c8e7aec2814ced0e

tags: added: in-stable-stein

Fix proposed to branch: master
Review: https://review.opendev.org/724188

Changed in nova:
assignee: nobody → sean mooney (sean-k-mooney)
status: Confirmed → In Progress
Changed in nova:
assignee: sean mooney (sean-k-mooney) → Dan Smith (danms)

Reviewed: https://review.opendev.org/724188
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d6f664524dfa4eecbe3e16fda0fec5359faf84c6
Submitter: Zuul
Branch: master

commit d6f664524dfa4eecbe3e16fda0fec5359faf84c6
Author: Sean Mooney <email address hidden>
Date: Tue Apr 28 23:20:16 2020 +0100

    Silence amqp heartbeat warning

    When the nova api is executing under uWSGI or MOD_WSGI
    the lifetime of the amqp heartbeat thread is controlled
    by the wsgi server. As a result when the nova api is run
    in this configuration we expect that the heartbeat thread
    will be suspended and heartbeats will be missed when the wsgi
    server suspends execution of the wsgi application.

    This change adds a python logging filter to suppress
    the reporting of heartbeat warnings as this behavior is
    expected. Since the operator cannot do anything to address
    the issue the warning is just noise and many operators
    and customers find it to be off-putting.

    Change-Id: I642b1e3ed6de2be4dcc19fe214f84095d2e1d31a
    Closes-Bug: #1825584

Changed in nova:
status: In Progress → Fix Released

Reviewed: https://review.opendev.org/724725
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ca2fe01cc1c73e9ca7b2376ea566db0fe45a44ea
Submitter: Zuul
Branch: master

commit ca2fe01cc1c73e9ca7b2376ea566db0fe45a44ea
Author: Sean Mooney <email address hidden>
Date: Thu Apr 30 19:09:17 2020 +0100

    remove support of oslo.messaging 9.8.0 warning message

    This change removes support for supressing the
    heartbeat message from olso.messaging before 9.8.0

    Change-Id: I2035d5df31e43b730cd472cc438ec863bb538d62
    Related-Bug: #1825584

Reviewed: https://review.opendev.org/728054
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1caddc4250b4acbdb8a945c8f26a0792bee5e2ed
Submitter: Zuul
Branch: stable/ussuri

commit 1caddc4250b4acbdb8a945c8f26a0792bee5e2ed
Author: Sean Mooney <email address hidden>
Date: Tue Apr 28 23:20:16 2020 +0100

    Silence amqp heartbeat warning

    When the nova api is executing under uWSGI or MOD_WSGI
    the lifetime of the amqp heartbeat thread is controlled
    by the wsgi server. As a result when the nova api is run
    in this configuration we expect that the heartbeat thread
    will be suspended and heartbeats will be missed when the wsgi
    server suspends execution of the wsgi application.

    This change adds a python logging filter to suppress
    the reporting of heartbeat warnings as this behavior is
    expected. Since the operator cannot do anything to address
    the issue the warning is just noise and many operators
    and customers find it to be off-putting.

    Change-Id: I642b1e3ed6de2be4dcc19fe214f84095d2e1d31a
    Closes-Bug: #1825584
    (cherry picked from commit d6f664524dfa4eecbe3e16fda0fec5359faf84c6)

tags: added: in-stable-ussuri

Reviewed: https://review.opendev.org/728055
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ed0524606998847a7b1ea9103579d027db092e83
Submitter: Zuul
Branch: stable/ussuri

commit ed0524606998847a7b1ea9103579d027db092e83
Author: Sean Mooney <email address hidden>
Date: Thu Apr 30 19:09:17 2020 +0100

    remove support of oslo.messaging 9.8.0 warning message

    This change removes support for supressing the
    heartbeat message from olso.messaging before 9.8.0

    Change-Id: I2035d5df31e43b730cd472cc438ec863bb538d62
    Related-Bug: #1825584
    (cherry picked from commit ca2fe01cc1c73e9ca7b2376ea566db0fe45a44ea)

Change abandoned by sean mooney (<email address hidden>) on branch: stable/train
Review: https://review.opendev.org/728058
Reason: i based this on upper-constraitns not lower so this should not be backported to train

Changed in nova:
importance: Undecided → Low

Reviewed: https://review.opendev.org/728057
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=280007c870b671de9439efba71fd3916403d15da
Submitter: Zuul
Branch: stable/train

commit 280007c870b671de9439efba71fd3916403d15da
Author: Sean Mooney <email address hidden>
Date: Tue Apr 28 23:20:16 2020 +0100

    Silence amqp heartbeat warning

    When the nova api is executing under uWSGI or MOD_WSGI
    the lifetime of the amqp heartbeat thread is controlled
    by the wsgi server. As a result when the nova api is run
    in this configuration we expect that the heartbeat thread
    will be suspended and heartbeats will be missed when the wsgi
    server suspends execution of the wsgi application.

    This change adds a python logging filter to suppress
    the reporting of heartbeat warnings as this behavior is
    expected. Since the operator cannot do anything to address
    the issue the warning is just noise and many operators
    and customers find it to be off-putting.

    Changes:
      nova/config.py
    in python 3 log filters can simply be a function that
    returns true or false and accepts single argument which
    is the record to filter. To support python 2 the
    rabbit_heartbeat_filter has been converted into an instance
    of a new HeartbeatFilter class with its implementation moved
    to the filter method.

    Change-Id: I642b1e3ed6de2be4dcc19fe214f84095d2e1d31a
    Closes-Bug: #1825584
    (cherry picked from commit d6f664524dfa4eecbe3e16fda0fec5359faf84c6)
    (cherry picked from commit 1caddc4250b4acbdb8a945c8f26a0792bee5e2ed)

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers