nova placement api non-responsive due to eventlet error

Bug #1829062 reported by Gerry Kopec on 2019-05-14
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Undecided
Unassigned
StarlingX
High
Gerry Kopec
tripleo
High
Unassigned

Bug Description

In starlingx setup, we're running a nova docker image based on nova stable/stein as of May 6.
We're seeing nova-compute processes stalling and not creating resource providers with placement.
openstack hypervisor list
+----+---------------------+-----------------+-----------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+-----------------+-------+
| 5 | worker-1 | QEMU | 192.168.206.247 | down |
| 8 | worker-2 | QEMU | 192.168.206.211 | down |
+----+---------------------+-----------------+-----------------+-------+

Observe this error in nova-placement-api logs related to eventlet at same time:
2019-05-14 00:44:03.636229 Traceback (most recent call last):
2019-05-14 00:44:03.636276 File "/var/lib/openstack/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 460, in fire_timers
2019-05-14 00:44:03.636536 timer()
2019-05-14 00:44:03.636560 File "/var/lib/openstack/lib/python2.7/site-packages/eventlet/hubs/timer.py", line 59, in _call_
2019-05-14 00:44:03.636647 cb(*args, **kw)
2019-05-14 00:44:03.636661 File "/var/lib/openstack/lib/python2.7/site-packages/eventlet/semaphore.py", line 147, in _do_acquire
2019-05-14 00:44:03.636774 waiter.switch()
2019-05-14 00:44:03.636792 error: cannot switch to a different thread

This is a new behaviour for us in stable/stein and suspect this is due to merge of eventlet related change on May 4:
https://github.com/openstack/nova/commit/6755034e109079fb5e8bbafcd611a919f0884d14

Ghada Khalil (gkhalil) on 2019-05-14
tags: added: stx.distro.openstack
Changed in starlingx:
importance: Undecided → Critical
tags: added: stx.2.0
Ghada Khalil (gkhalil) wrote :

This is a gating/blocking item for the starlingx team. We need a nova fix in the stein branch.

Changed in starlingx:
assignee: nobody → Bruce Jones (brucej)
Matt Riedemann (mriedem) wrote :

https://review.opendev.org/#/c/647310/ was merged after being held out for awhile because both Red Hat and Canonical said they hit issues without it and they needed it. If it's causing other side effects then reverting that change is likely not a great option.

What happens if you run with OS_NOVA_DISABLE_EVENTLET_PATCHING=yes as a workaround?

Gerry Kopec (gerry-kopec) wrote :

Thanks for the suggestion. I can give that a try. For a start, I'll set that for nova placement api process and leave the rest of nova alone.

tags: added: placement
Gerry Kopec (gerry-kopec) wrote :

Using the workaround on nova placement api fixed the problem. Didn't see any side effects. Do we lose anything without monkey patch on placement?

Ghada Khalil (gkhalil) on 2019-05-16
Changed in starlingx:
assignee: Bruce Jones (brucej) → Gerry Kopec (gerry-kopec)
status: New → In Progress
Chris Dent (cdent) wrote :

> Do we lose anything without monkey patch on placement?

No, placement doesn't need eventlet monkey patching and would prefer the package not be imported at all.

I suspect that the reason that this problem has not been widely noticed is because in Stein placement is extracted to its own repo and project and primarily tested using the extracted placement (the nova functional tests use the extracted placement).

nova-placement-api remains around in Stein to ease upgrades (that is you can upgrade everything to Stein, keeping placement in nova, and then once that is done, switch to the extracted placement). Docs for some of that at: https://docs.openstack.org/placement/latest/upgrade/to-stein.html

Placement doesn't intentionally use eventlet anywhere. The placement-in-nova inherits monkey patching as a result of package hierarchies. In the extracted placement that goes away.

Therefore the cleanest, but not necessarily quickest, solution to this problem would be to switch to the extracted placement, which is going to need to happen at some point anyway. Now might be the time to pay the cost?

If there are questions or issues with making that happen, please as for help on the openstack-discuss list with a [placement] tag and several folk will leap to your aid.

Matt Riedemann (mriedem) wrote :

> Do we lose anything without monkey patch on placement?

As Chris said it won't affect the placement API, but removing the monkey patching in the nova-api will mean that cells get iterated sequentially rather than concurrently in the scatter_gather* routines like when listing servers and migrations. If you only have at most 2 cells (cell0 and cell1 at minimum) then maybe that's acceptable. For someone like CERN with 70+ cells it would not be.

sean mooney (sean-k-mooney) wrote :

Note:
We have determind that the nova api issue is realted to use both eventlets + wsgi configured to use more then one thread per interpreter instanced.

if you are using uswgi setting threads=1

https://uwsgi-docs.readthedocs.io/en/latest/Options.html#threads

or for mod_wsgi

WSGIDaemonProcess ... processes=<api workers> threads=1

https://modwsgi.readthedocs.io/en/develop/configuration-directives/WSGIDaemonProcess.html

will allow the nova-api to work correctly with eventlet monkey patching enabled.

when running the nova-api under mod_wsgi or uwsgi
parrallisium shold be managed at teh process level with 1 thread per porcess to avoid this issue

not that the wsgi server will still suspend the python interperter stopping the heartbeat in this
confuration when there are not active api requestion but it will allow concurent cell requests and the heatbeat/amqp connection will automaticaly be resotred when a new request is recived.

Bruce Jones (brucej) wrote :

Zhipeng is working on a container image for the new separate Placement and will consult with Gerry if needed. We think that moving to the separate Placement will resolve this issue.

melanie witt (melwitt) wrote :

Note that we are assuming from the error message:

  2019-05-14 00:44:03.636792 error: cannot switch to a different thread

that your wsgi app is configured with threads > 1.

Another workaround is to configure the wsgi app with threads=1, which will have each nova-api process use one thread. This workaround is useful if you do not want to serialize multi-cell queries in a large multiple cell deployment by setting OS_NOVA_DISABLE_EVENTLET_PATCHING=yes.

melanie witt (melwitt) wrote :

I've proposed a known issue reno explaining the issue in more detail here:

https://review.opendev.org/662095

Ghada Khalil (gkhalil) on 2019-06-03
Changed in starlingx:
importance: Critical → High
Gerry Kopec (gerry-kopec) wrote :

With extraction of placement from nova in starlingx openstack-helm charts via:
https://review.opendev.org/#/c/662371/
https://review.opendev.org/#/c/662614/
Would expect this issue to be resolved.

Gerry Kopec (gerry-kopec) wrote :

For starlingx, the next rebase of stx-nova to the latest stable/stein will contain the upstream commit that triggered the issue. Do not expect to see any problems now that we've removed placement from nova.

Changed in starlingx:
status: In Progress → Fix Released
Bogdan Dobrelya (bogdando) wrote :

So for TripleO, we're about to implement https://bugs.launchpad.net/tripleo/+bug/1829062/comments/7
based on that tuning MPM/event example https://review.opendev.org/#/c/72666/ from the past

Changed in tripleo:
status: New → In Progress
importance: Undecided → High
assignee: nobody → Bogdan Dobrelya (bogdando)
milestone: none → train-2
Bogdan Dobrelya (bogdando) wrote :

Although I'm not sure if changing MPM from prefork to event and keeping the same multi-process to a single-thread mapping, would buy anything to us with the subject regard.

Related fix proposed to branch: master
Review: https://review.opendev.org/671321

Reviewed: https://review.opendev.org/671321
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=6fb9d8e6cd48283b551f3580072669622da2006c
Submitter: Zuul
Branch: master

commit 6fb9d8e6cd48283b551f3580072669622da2006c
Author: Bogdan Dobrelya <email address hidden>
Date: Wed Jul 17 17:33:44 2019 +0200

    Allow to configure Apache MPM module

    Defaults to 'prefork', which ensures there is no upgrade/update impact.

    Related-bug: #1829062

    Change-Id: I3deb3e944ed4911962d204357bb3134569f153f6
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in tripleo:
milestone: train-2 → train-3
Chris Dent (cdent) wrote :

A reminder for people still working this. This goes away if you do either of:

* you run mod_wsgi's DaemonProcess with processes=N threads=1
* use the non-nova placement that was extracted from nova in stein (there's no eventlet monkey patching in extracted placement because none of the code uses eventlet)

Change abandoned by Bogdan Dobrelya (bogdando) (<email address hidden>) on branch: master
Review: https://review.opendev.org/668862

Reviewed: https://review.opendev.org/671335
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=09cfcc1464dce0eb7c05caf42375290bbaae4199
Submitter: Zuul
Branch: master

commit 09cfcc1464dce0eb7c05caf42375290bbaae4199
Author: Bogdan Dobrelya <email address hidden>
Date: Wed Jul 17 18:38:06 2019 +0200

    Wire-in Apache MPM module parameters and switch it

    Allow to configure Apache MPM module for the containerized API/WSGI'ish
    services running Apache as a backend. Change the default from 'prefork'
    to 'event', which is a low level change and should provide no sensible
    upgrade impact. This alleviates the related heartbeats threading issue
    arising with the monkey-patched eventlet.

    Merge the missing ApacheServiceBase config settings for Octavia API,
    Horizon and Ironix PXE. This is needed to apply the base Apache
    service hiera settings, including MPM module switches, for those
    as well.

    Related-bug: #1829062

    Change-Id: Ia65af7a9d6ae106a61ec52912bebba72830d5f28
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in tripleo:
status: In Progress → Fix Released
tags: added: queens-backport-potential rocky-backport-potential stein-backport-potential
Changed in tripleo:
status: Fix Released → Triaged
assignee: Bogdan Dobrelya (bogdando) → nobody

Change abandoned by Bogdan Dobrelya (bogdando) (<email address hidden>) on branch: stable/stein
Review: https://review.opendev.org/673978

Change abandoned by Bogdan Dobrelya (bogdando) (<email address hidden>) on branch: stable/stein
Review: https://review.opendev.org/673974

Related fix proposed to branch: master
Review: https://review.opendev.org/674589

Changed in tripleo:
milestone: train-3 → ussuri-1

Reviewed: https://review.opendev.org/662095
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a694952eacfe3a2dac34957cf95d5529eb89d4b2
Submitter: Zuul
Branch: stable/stein

commit a694952eacfe3a2dac34957cf95d5529eb89d4b2
Author: melanie witt <email address hidden>
Date: Wed May 29 21:32:11 2019 +0000

    Add reno about nova-api eventlet monkey-patching and rabbitmq

    This adds a known issue release note about eventlet monkey-patching in
    nova-api and workarounds.

    Related-Bug: #1825584
    Related-Bug: #1829062

    Change-Id: I22abd1f5377489dd809eb705c8e7aec2814ced0e

tags: added: in-stable-stein
Changed in tripleo:
milestone: ussuri-1 → ussuri-2

Change abandoned by Bogdan Dobrelya (bogdando) (<email address hidden>) on branch: master
Review: https://review.opendev.org/674588

Change abandoned by Bogdan Dobrelya (bogdando) (<email address hidden>) on branch: master
Review: https://review.opendev.org/674589

wes hayutin (weshayutin) on 2020-02-10
Changed in tripleo:
milestone: ussuri-2 → ussuri-3
wes hayutin (weshayutin) on 2020-04-13
Changed in tripleo:
milestone: ussuri-3 → ussuri-rc3

Based on Chris' comment above[1] I'm closing this issue on Nova. Since Stein it is not an issue and Rocky is already in extended maintenance. If somebody want's to fix this in Rocky and older branches the please set the bug back to New.

[1] https://bugs.launchpad.net/nova/+bug/1829062/comments/19

Changed in nova:
status: New → Won't Fix
wes hayutin (weshayutin) on 2020-05-26
Changed in tripleo:
milestone: ussuri-rc3 → victoria-1
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers