Race condition in signal handling on Python 3
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
oslo.service |
Fix Released
|
High
|
Zane Bitter |
Bug Description
Signal handling of oslo_service suffers of a race condition on Python 3.5 if the signal is received while eventlet sleeps in its hub, for example waits for events with the C function epoll_wait().
Python 3.5 changed how Python handles UNIX signals, see the PEP 475:
https:/
Let's say that eventlet is waiting for events: it sleeps in the epoll hub, blocks on the C epoll_wait() function.
Suddently, a SIGTERM signal is received.
On Python 2:
* epoll_wait() is interrupted immediately and fails with EINTR errno
* eventlet catchs IOError(
* Python detects that it got a signal and calls the first oslo_service Python handler which schedules a greenthread using eventlet.spawn()
* Back to eventlet, eventlet has ready greenthreads and so run them
* The greenthread is run: the second and final oslo_service signal handler
On Python 3:
* epoll_wait() is interrupted immediately and fails with EINTR errno
* epoll_wait() calls PyErr_CheckSign
* PyErr_CheckSign
* The Python signal handlers schedules a greenthread using eventlet.spawn()
* PyErr_CheckSign
* ... pyepoll_poll() restarts the interrupted epoll_wait()
* epoll_wait() polls for events ... which are not going to happen... the application is stuck!
The fix is to use an internal pipe in oslo_service to make sure that the first Python signal handler of oslo_service wakes up the event loop. I'm writing a fix for oslo_service.
Attached bug.py is a simple Python script using eventlet to reproduce the bug.
This bug impacts Neutron functional tests. My script to reproduce the bug:
export envdir= /opt/stack/ neutron/ .tox/dsvm- functional- python35 CMD="sudo ${envdir} /bin/neutron- rootwrap ${envdir} /etc/neutron/ rootwrap. conf" OS_ROOTWRAP_ DAEMON_ CMD="sudo ${envdir} /bin/neutron- rootwrap- daemon ${envdir} /etc/neutron/ rootwrap. conf" OS_TEST_ PATH=./ neutron/ tests/functiona l python3.5 -m testtools.run neutron. tests.functiona l.test_ server. TestPluginWorke r.test_ start
OS_SUDO_TESTING=1 OS_ROOTWRAP_
On Python 2.7, the functional test pass. On Python 3.5, it fails with a timeout.