eventlet errors and cannot switch to a different thread

Bug #1694509 reported by Clark Boylan
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Cinder
Fix Released
High
Unassigned
tooz
Invalid
High
Unassigned

Bug Description

During the deletion of a volume eventlet can error being unable to switch threads. We also see that the volume fails to delete properly ( according to tempest ) so possibly these two things are related.

From: http://logs.openstack.org/28/461628/10/gate/gate-tempest-dsvm-neutron-full-ubuntu-xenial/cf6c9e9/logs/screen-c-vol.txt.gz#_May_26_20_44_32_284561

May 26 20:44:32.276580 ubuntu-xenial-rax-dfw-9013276 cinder-volume[27348]: INFO cinder.volume.manager [req-34715675-437c-4668-8f27-ee0f5f10cabb tempest-VolumesDeleteCascade-157659443 None] Delete snapshot completed successfully.
May 26 20:44:32.283596 ubuntu-xenial-rax-dfw-9013276 cinder-volume[27348]: Traceback (most recent call last):
May 26 20:44:32.283764 ubuntu-xenial-rax-dfw-9013276 cinder-volume[27348]: File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 457, in fire_timers
May 26 20:44:32.283899 ubuntu-xenial-rax-dfw-9013276 cinder-volume[27348]: timer()
May 26 20:44:32.284031 ubuntu-xenial-rax-dfw-9013276 cinder-volume[27348]: File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/timer.py", line 58, in __call__
May 26 20:44:32.284161 ubuntu-xenial-rax-dfw-9013276 cinder-volume[27348]: cb(*args, **kw)
May 26 20:44:32.284293 ubuntu-xenial-rax-dfw-9013276 cinder-volume[27348]: File "/usr/local/lib/python2.7/dist-packages/eventlet/semaphore.py", line 147, in _do_acquire
May 26 20:44:32.284429 ubuntu-xenial-rax-dfw-9013276 cinder-volume[27348]: waiter.switch()
May 26 20:44:32.284561 ubuntu-xenial-rax-dfw-9013276 cinder-volume[27348]: error: cannot switch to a different thread

Getting this bug up so that we can track the situation with e-r which should help us debug further (understanding that right now this is fairly sparse on details).

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :
Changed in cinder:
status: New → Confirmed
Revision history for this message
Matt Riedemann (mriedem) wrote :

Yeah I've seen this twice this week:

http://logs.openstack.org/49/483149/2/check/gate-novaclient-dsvm-functional-neutron-ubuntu-xenial/d37b26f/logs/screen-c-vol.txt.gz#_Jul_13_20_07_49_514341

Apparently it happens a lot:

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22error%3A%20cannot%20switch%20to%20a%20different%20thread%5C%22%20AND%20tags%3A%5C%22screen-c-vol.txt%5C%22&from=7d

but it's not always related to an error - but I have seen cases where a volume fails to be created or at least report that it's created so tests just timeout and fail.

Revision history for this message
Matt Riedemann (mriedem) wrote :

logstash shows this happening since 7/13 - did something merge recently or did we get a new library version in upper-constraints that might cause this?

Revision history for this message
Matt Riedemann (mriedem) wrote :

I take that back, looks like it's spiking more around 7/11.

Changed in cinder:
importance: Undecided → High
Revision history for this message
Matt Riedemann (mriedem) wrote :
Download full text (6.0 KiB)

Not sure if this would be related but in other jobs where volumes are being created from an image and then just randomly don't complete, I got a GMR and see stuff like this:

http://logs.openstack.org/00/469800/31/check/gate-tempest-dsvm-cells-ubuntu-xenial/7156eed/logs/screen-c-vol.txt.gz#_Jul_14_13_59_21_489470

Jul 14 13:59:21.486855 ubuntu-xenial-rax-ord-9846339 cinder-volume[20607]: /usr/local/lib/python2.7/dist-packages/taskflow/engines/action_engine/scheduler.py:70 in schedule
Jul 14 13:59:21.486975 ubuntu-xenial-rax-ord-9846339 cinder-volume[20607]: `return self._task_action.schedule_execution(task)`
Jul 14 13:59:21.487088 ubuntu-xenial-rax-ord-9846339 cinder-volume[20607]: /usr/local/lib/python2.7/dist-packages/taskflow/engines/action_engine/actions/task.py:113 in schedule_execution
Jul 14 13:59:21.487199 ubuntu-xenial-rax-ord-9846339 cinder-volume[20607]: `progress_callback=progress_callback)`
Jul 14 13:59:21.487311 ubuntu-xenial-rax-ord-9846339 cinder-volume[20607]: /usr/local/lib/python2.7/dist-packages/taskflow/engines/action_engine/executor.py:150 in execute_task
Jul 14 13:59:21.487422 ubuntu-xenial-rax-ord-9846339 cinder-volume[20607]: `progress_callback=progress_callback)`
Jul 14 13:59:21.487593 ubuntu-xenial-rax-ord-9846339 cinder-volume[20607]: /usr/local/lib/python2.7/dist-packages/futurist/_futures.py:298 in submit
Jul 14 13:59:21.487718 ubuntu-xenial-rax-ord-9846339 cinder-volume[20607]: `return self._gatherer.submit(fn, *args, **kwargs)`
Jul 14 13:59:21.487840 ubuntu-xenial-rax-ord-9846339 cinder-volume[20607]: /usr/local/lib/python2.7/dist-packages/futurist/_futures.py:95 in submit
Jul 14 13:59:21.487957 ubuntu-xenial-rax-ord-9846339 cinder-volume[20607]: `fut = self._submit_func(fn, *args, **kwargs)`
Jul 14 13:59:21.488134 ubuntu-xenial-rax-ord-9846339 cinder-volume[20607]: /usr/local/lib/python2.7/dist-packages/futurist/_futures.py:302 in _submit
Jul 14 13:59:21.488306 ubuntu-xenial-rax-ord-9846339 cinder-volume[20607]: `self._run_work_func(_utils.WorkItem(fut, fn, args, kwargs))`
Jul 14 13:59:21.488421 ubuntu-xenial-rax-ord-9846339 cinder-volume[20607]: /usr/local/lib/python2.7/dist-packages/futurist/_futures.py:243 in <lambda>
Jul 14 13:59:21.488532 ubuntu-xenial-rax-ord-9846339 cinder-volume[20607]: `def __init__(self, green=False, run_work_func=lambda work: work.run()):`
Jul 14 13:59:21.488663 ubuntu-xenial-rax-ord-9846339 cinder-volume[20607]: /usr/local/lib/python2.7/dist-packages/futurist/_utils.py:47 in run
Jul 14 13:59:21.488794 ubuntu-xenial-rax-ord-9846339 cinder-volume[20607]: `result = self.fn(*self.args, **self.kwargs)`
Jul 14 13:59:21.488913 ubuntu-xenial-rax-ord-9846339 cinder-volume[20607]: /usr/local/lib/python2.7/dist-packages/taskflow/engines/action_engine/executor.py:53 in _execute_task
Jul 14 13:59:21.489110 ubuntu-xenial-rax-ord-9846339 cinder-volume[20607]: `result = task.execute(**arguments)`
Jul 14 13:59:21.489240 ubuntu-xenial-rax-ord-9846339 cinder-volume[20607]: /opt/stack/new/cinder/cinder/volume/flows/manager/create_volume.py:916 in execute
Jul 14 13:59:21.489353 ubuntu-xenial-rax-ord-9846339 cinder-volume[20607]: `**volume_spec)`
Jul 14 13:59:21.4...

Read more...

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :

This cinder.coordination module could also be suspect, it's in that GMR traceback and uses eventlet and tooz:

https://github.com/openstack/cinder/blob/master/cinder/coordination.py

And tooz was also updated in upper-constraints on 7/11:

https://review.openstack.org/#/c/482272/

To use version 1.57.4.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :

This cinder.coordination module is using etcd3 as the tooz backend too:

http://logs.openstack.org/49/483149/2/check/gate-novaclient-dsvm-functional-neutron-ubuntu-xenial/d37b26f/logs/screen-c-vol.txt.gz#_Jul_13_19_55_02_134315

Jul 13 19:55:02.134315 ubuntu-xenial-rax-ord-9829148 cinder-volume[29593]: DEBUG oslo_service.service [None req-ae11374e-2a1d-4b0d-be11-3d777d39cc6d None None] coordination.backend_url = etcd3+http://104.130.222.125:2379 {{(pid=29593) log_opt_values /usr/local/lib/python2.7/dist-packages/oslo_config/cfg.py:2883}}

There is also this suspect change:

https://github.com/openstack/tooz/commit/7987f4455a7ff4e75d64c46a42d50017f53bfc9e

Revision history for this message
Matt Riedemann (mriedem) wrote :
Matt Riedemann (mriedem)
Changed in python-tooz:
status: New → Confirmed
importance: Undecided → High
Changed in cinder:
status: Confirmed → Invalid
Revision history for this message
Julien Danjou (jdanjou) wrote :

Are you using etcd3:// in tooz?

Eventlet monkeypatching does not work with gRPC, this already have been discussed:

http://lists.openstack.org/pipermail/openstack-dev/2017-March/114427.html

Revision history for this message
ChangBo Guo(gcb) (glongwave) wrote :

From Matt's comment #10, it's using etcd3, what should we from tooz, log warning when using etcd3?

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :
Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :
Revision history for this message
Julien Danjou (jdanjou) wrote :

Since https://review.openstack.org/#/c/478202/2/tooz/drivers/etcd3gw.py is using a lock, I imagine there's a deadlock going on because eventlet might switch of e.g. refresh() (which does I/O in the heartbeat thread) and some other threads?

I don't understand what "error: cannot switch to a different thread" means in Eventlet context. If somebody can enlighten me, that might help!

Revision history for this message
ChangBo Guo(gcb) (glongwave) wrote :

We will release tooz's final release for Pike this week, talked with dims, we will revert potential risky patches before we figure out the root cause.
https://review.openstack.org/485035
https://review.openstack.org/485036

Revision history for this message
Julien Danjou (jdanjou) wrote :

Another random idea: it'd be good to know if it's a problem only triggered by etcd3+http or if it's with other drivers too (e.g. redis).

Revision history for this message
ChangBo Guo(gcb) (glongwave) wrote :

we need discuss with Cinder team before reverting patches.

Revision history for this message
Mehdi Abaakouk (sileht) wrote :

cinder run the tooz heartbeat into a real thread (with tpool.execute()) instead of a greenthread:

https://github.com/openstack/cinder/blob/master/cinder/coordination.py#L90

So anything that may call the eventlet hub.switch() (monkeypatch IO, lock stuff, spawn, ...) during the tooz heartbeat() can trigger the bug what ever the used driver.

The recent lock added to the etcd3gw driver in tooz just make it happen a lot.

https://review.openstack.org/#/c/447079/ should fix the issue.

Mehdi Abaakouk (sileht)
Changed in python-tooz:
status: Confirmed → Invalid
Changed in cinder:
status: Invalid → Confirmed
Revision history for this message
Clark Boylan (cboylan) wrote :

Before marking this invalid in tooz you might consider updating https://docs.openstack.org/tooz/latest/user/tutorial/coordinator.html to discourage running your own heartbeat() calls as there are know issues doing so when mixing eventlet greenthreads and threads.

Revision history for this message
Mehdi Abaakouk (sileht) wrote :

Tooz has no issue.

Limitation about mixing thread and greenthread are well documented on eventlet side:

http://eventlet.net/doc/threading.html

Revision history for this message
Clark Boylan (cboylan) wrote :

The issue is in the too docs (which I linked) which does not document this problem. The docs should be updated.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Changed in cinder:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.