nova placement api non-responsive due to eventlet error

Bug #1829062 reported by Gerry Kopec on 2019-05-14
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Undecided
Unassigned
StarlingX
High
Gerry Kopec

Bug Description

In starlingx setup, we're running a nova docker image based on nova stable/stein as of May 6.
We're seeing nova-compute processes stalling and not creating resource providers with placement.
openstack hypervisor list
+----+---------------------+-----------------+-----------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+-----------------+-------+
| 5 | worker-1 | QEMU | 192.168.206.247 | down |
| 8 | worker-2 | QEMU | 192.168.206.211 | down |
+----+---------------------+-----------------+-----------------+-------+

Observe this error in nova-placement-api logs related to eventlet at same time:
2019-05-14 00:44:03.636229 Traceback (most recent call last):
2019-05-14 00:44:03.636276 File "/var/lib/openstack/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 460, in fire_timers
2019-05-14 00:44:03.636536 timer()
2019-05-14 00:44:03.636560 File "/var/lib/openstack/lib/python2.7/site-packages/eventlet/hubs/timer.py", line 59, in _call_
2019-05-14 00:44:03.636647 cb(*args, **kw)
2019-05-14 00:44:03.636661 File "/var/lib/openstack/lib/python2.7/site-packages/eventlet/semaphore.py", line 147, in _do_acquire
2019-05-14 00:44:03.636774 waiter.switch()
2019-05-14 00:44:03.636792 error: cannot switch to a different thread

This is a new behaviour for us in stable/stein and suspect this is due to merge of eventlet related change on May 4:
https://github.com/openstack/nova/commit/6755034e109079fb5e8bbafcd611a919f0884d14

Ghada Khalil (gkhalil) on 2019-05-14
tags: added: stx.distro.openstack
Changed in starlingx:
importance: Undecided → Critical
tags: added: stx.2.0
Ghada Khalil (gkhalil) wrote :

This is a gating/blocking item for the starlingx team. We need a nova fix in the stein branch.

Changed in starlingx:
assignee: nobody → Bruce Jones (brucej)
Matt Riedemann (mriedem) wrote :

https://review.opendev.org/#/c/647310/ was merged after being held out for awhile because both Red Hat and Canonical said they hit issues without it and they needed it. If it's causing other side effects then reverting that change is likely not a great option.

What happens if you run with OS_NOVA_DISABLE_EVENTLET_PATCHING=yes as a workaround?

Gerry Kopec (gerry-kopec) wrote :

Thanks for the suggestion. I can give that a try. For a start, I'll set that for nova placement api process and leave the rest of nova alone.

tags: added: placement
Gerry Kopec (gerry-kopec) wrote :

Using the workaround on nova placement api fixed the problem. Didn't see any side effects. Do we lose anything without monkey patch on placement?

Ghada Khalil (gkhalil) on 2019-05-16
Changed in starlingx:
assignee: Bruce Jones (brucej) → Gerry Kopec (gerry-kopec)
status: New → In Progress
Chris Dent (cdent) wrote :

> Do we lose anything without monkey patch on placement?

No, placement doesn't need eventlet monkey patching and would prefer the package not be imported at all.

I suspect that the reason that this problem has not been widely noticed is because in Stein placement is extracted to its own repo and project and primarily tested using the extracted placement (the nova functional tests use the extracted placement).

nova-placement-api remains around in Stein to ease upgrades (that is you can upgrade everything to Stein, keeping placement in nova, and then once that is done, switch to the extracted placement). Docs for some of that at: https://docs.openstack.org/placement/latest/upgrade/to-stein.html

Placement doesn't intentionally use eventlet anywhere. The placement-in-nova inherits monkey patching as a result of package hierarchies. In the extracted placement that goes away.

Therefore the cleanest, but not necessarily quickest, solution to this problem would be to switch to the extracted placement, which is going to need to happen at some point anyway. Now might be the time to pay the cost?

If there are questions or issues with making that happen, please as for help on the openstack-discuss list with a [placement] tag and several folk will leap to your aid.

Matt Riedemann (mriedem) wrote :

> Do we lose anything without monkey patch on placement?

As Chris said it won't affect the placement API, but removing the monkey patching in the nova-api will mean that cells get iterated sequentially rather than concurrently in the scatter_gather* routines like when listing servers and migrations. If you only have at most 2 cells (cell0 and cell1 at minimum) then maybe that's acceptable. For someone like CERN with 70+ cells it would not be.

sean mooney (sean-k-mooney) wrote :

Note:
We have determind that the nova api issue is realted to use both eventlets + wsgi configured to use more then one thread per interpreter instanced.

if you are using uswgi setting threads=1

https://uwsgi-docs.readthedocs.io/en/latest/Options.html#threads

or for mod_wsgi

WSGIDaemonProcess ... processes=<api workers> threads=1

https://modwsgi.readthedocs.io/en/develop/configuration-directives/WSGIDaemonProcess.html

will allow the nova-api to work correctly with eventlet monkey patching enabled.

when running the nova-api under mod_wsgi or uwsgi
parrallisium shold be managed at teh process level with 1 thread per porcess to avoid this issue

not that the wsgi server will still suspend the python interperter stopping the heartbeat in this
confuration when there are not active api requestion but it will allow concurent cell requests and the heatbeat/amqp connection will automaticaly be resotred when a new request is recived.

Bruce Jones (brucej) wrote :

Zhipeng is working on a container image for the new separate Placement and will consult with Gerry if needed. We think that moving to the separate Placement will resolve this issue.

melanie witt (melwitt) wrote :

Note that we are assuming from the error message:

  2019-05-14 00:44:03.636792 error: cannot switch to a different thread

that your wsgi app is configured with threads > 1.

Another workaround is to configure the wsgi app with threads=1, which will have each nova-api process use one thread. This workaround is useful if you do not want to serialize multi-cell queries in a large multiple cell deployment by setting OS_NOVA_DISABLE_EVENTLET_PATCHING=yes.

melanie witt (melwitt) wrote :

I've proposed a known issue reno explaining the issue in more detail here:

https://review.opendev.org/662095

Ghada Khalil (gkhalil) on 2019-06-03
Changed in starlingx:
importance: Critical → High
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers