Race condition in signal handling on Python 3

Bug #1705047 reported by Victor Stinner on 2017-07-18
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
oslo.service
High
Zane Bitter

Bug Description

Signal handling of oslo_service suffers of a race condition on Python 3.5 if the signal is received while eventlet sleeps in its hub, for example waits for events with the C function epoll_wait().

Python 3.5 changed how Python handles UNIX signals, see the PEP 475:
https://www.python.org/dev/peps/pep-0475/

Let's say that eventlet is waiting for events: it sleeps in the epoll hub, blocks on the C epoll_wait() function.

Suddently, a SIGTERM signal is received.

On Python 2:

* epoll_wait() is interrupted immediately and fails with EINTR errno
* eventlet catchs IOError(errno=EINTR) and ignores it
* Python detects that it got a signal and calls the first oslo_service Python handler which schedules a greenthread using eventlet.spawn()
* Back to eventlet, eventlet has ready greenthreads and so run them
* The greenthread is run: the second and final oslo_service signal handler

On Python 3:

* epoll_wait() is interrupted immediately and fails with EINTR errno
* epoll_wait() calls PyErr_CheckSignals()
* PyErr_CheckSignals() calls the first oslo_service Python handler
* The Python signal handlers schedules a greenthread using eventlet.spawn()
* PyErr_CheckSignals() completes with no exception...
* ... pyepoll_poll() restarts the interrupted epoll_wait()
* epoll_wait() polls for events ... which are not going to happen... the application is stuck!

The fix is to use an internal pipe in oslo_service to make sure that the first Python signal handler of oslo_service wakes up the event loop. I'm writing a fix for oslo_service.

Attached bug.py is a simple Python script using eventlet to reproduce the bug.

Victor Stinner (victor-stinner) wrote :
Victor Stinner (victor-stinner) wrote :

This bug impacts Neutron functional tests. My script to reproduce the bug:

export envdir=/opt/stack/neutron/.tox/dsvm-functional-python35
OS_SUDO_TESTING=1 OS_ROOTWRAP_CMD="sudo ${envdir}/bin/neutron-rootwrap ${envdir}/etc/neutron/rootwrap.conf" OS_ROOTWRAP_DAEMON_CMD="sudo ${envdir}/bin/neutron-rootwrap-daemon ${envdir}/etc/neutron/rootwrap.conf" OS_TEST_PATH=./neutron/tests/functional python3.5 -m testtools.run neutron.tests.functional.test_server.TestPluginWorker.test_start

On Python 2.7, the functional test pass. On Python 3.5, it fails with a timeout.

This bug is part of python3 community effort, raising the priority to High.

Changed in oslo.service:
importance: Undecided → High
status: New → Confirmed

Fix proposed to branch: master
Review: https://review.openstack.org/488421

Changed in oslo.service:
assignee: nobody → Victor Stinner (victor-stinner)
status: Confirmed → In Progress

Change abandoned by Victor Stinner (<email address hidden>) on branch: master
Review: https://review.openstack.org/488421

Fix proposed to branch: master
Review: https://review.openstack.org/566714

Changed in oslo.service:
assignee: Victor Stinner (victor-stinner) → Zane Bitter (zaneb)

Reviewed: https://review.openstack.org/566714
Committed: https://git.openstack.org/cgit/openstack/oslo.service/commit/?id=cad75e4e139f734a5138d37ceafa6be169ff4e47
Submitter: Zuul
Branch: master

commit cad75e4e139f734a5138d37ceafa6be169ff4e47
Author: Zane Bitter <email address hidden>
Date: Fri May 4 15:04:52 2018 -0400

    Python 3: Fix eventlet wakeup after signal

    With the implementation of PEP 475 in Python 3.5, system calls that fail
    with EINTR are automatically restarted. This means that even when creating
    greenthreads in a signal handler, eventlet will not wake up until it
    sees activity on the file descriptors it is polling or until the next
    timer goes off. This can cause the process to appear to hang.

    To work around this, raise an exception when exiting the interrupt
    handler if the signal occurred while eventlet was polling. This
    exception will then be raised by the library call instead of retrying,
    and eventlet will catch it using the same logic it uses in Python 2 for
    handling EINTR.

    Do the same when the signal occurred while eventlet was sleeping. (It
    calls sleep() instead of poll() when there are no file descriptors to
    poll, only timers.) To emulate Python 2 behaviour, which is to interrupt
    sleep but *not* raise an exception, wrap the sleep call to catch and
    ignore EINTR exceptions. This is necessary because eventlet doesn't
    attempt to catch any exceptions from sleep() like it does from poll().

    Change-Id: Ic7ac8244b804784dd60f87ba411ce3236ea1bd90
    Closes-Bug: #1705047

Changed in oslo.service:
status: In Progress → Fix Released

This issue was fixed in the openstack/oslo.service 1.31.2 release.

Reviewed: https://review.openstack.org/624006
Committed: https://git.openstack.org/cgit/openstack/oslo.service/commit/?id=159ef2e1d26f25a5d7a0514d5155f3c74c4a8a86
Submitter: Zuul
Branch: master

commit 159ef2e1d26f25a5d7a0514d5155f3c74c4a8a86
Author: Zane Bitter <email address hidden>
Date: Mon Dec 10 19:42:30 2018 +1300

    Restore correct signal handling in Python3

    The patch 2ee3894f49f315e35abff968f54ae72e5480e892 broke the original
    fix cad75e4e139f734a5138d37ceafa6be169ff4e47 that ensured eventlet could
    be interrupted while sleeping after PEP475 was implemented in Python
    3.5. Eventlet monkey-patches the signal module with its own version, so
    we have to look up the original module to determine whether the
    underlying OS actually supports the poll() function.

    Change-Id: Ia712c9a83d8081bf0b5e6fe36f169f9028aae3dc
    Closes-Bug: #1803731
    Related-Bug: #1788022
    Related-Bug: #1705047

Reviewed: https://review.openstack.org/626398
Committed: https://git.openstack.org/cgit/openstack/oslo.service/commit/?id=d1295d45eedbae7f524026af207c8196a8925fa9
Submitter: Zuul
Branch: stable/rocky

commit d1295d45eedbae7f524026af207c8196a8925fa9
Author: Zane Bitter <email address hidden>
Date: Mon Dec 10 19:42:30 2018 +1300

    Restore correct signal handling in Python3

    The patch c8c8946d3281ef65de70b03dd198b6be6257fa7f broke the original
    fix cad75e4e139f734a5138d37ceafa6be169ff4e47 that ensured eventlet could
    be interrupted while sleeping after PEP475 was implemented in Python
    3.5. Eventlet monkey-patches the signal module with its own version, so
    we have to look up the original module to determine whether the
    underlying OS actually supports the poll() function.

    Change-Id: Ia712c9a83d8081bf0b5e6fe36f169f9028aae3dc
    Closes-Bug: #1803731
    Related-Bug: #1788022
    Related-Bug: #1705047
    (cherry picked from commit 159ef2e1d26f25a5d7a0514d5155f3c74c4a8a86)

tags: added: in-stable-rocky
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers