eventlet not waking after signal on py3 (and test_child_signal_sighup takes >60s/times out)

Bug #1803731 reported by Corey Bryant on 2018-11-16
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
oslo.service
Medium
Zane Bitter

Bug Description

It takes oslo_service.tests.test_service.ServiceLauncherTest.test_child_signal_sighup approximately 500 times longer to run on Python 3 than Python2.

Here are some more details on test runs: https://paste.ubuntu.com/p/JtMSTsmHfW/

Notice the runtime for py35:

Test id Runtime (s)
---------------------------------------------------------------------------- -----------
oslo_service.tests.test_service.ServiceLauncherTest.test_child_signal_sighup 60.047

And py27:

Test id Runtime (s)
---------------------------------------------------------------------------- -----------
oslo_service.tests.test_service.ServiceLauncherTest.test_child_signal_sighup 0.127

summary: - py3: it takes 60 seconds to run
- oslo_service.tests.test_service.ServiceLauncherTest.test_child_signal_sighup
+ py3: test_child_signal_sighup takes 500 times longer to run on py3
summary: - py3: test_child_signal_sighup takes 500 times longer to run on py3
+ py3: test_child_signal_sighup takes ~500 times longer to run on py3
Ben Nemec (bnemec) on 2018-11-27
Changed in oslo.service:
status: New → Confirmed
importance: Undecided → Medium

Fun fact: it also uses 100% of a CPU core during those 60s.

Python 3.4 is not affected, which suggests an issue with the fix for bug 1705047, but the problem appears to predate that patch (although strangely I can no longer reproduce the bug that patch fixed, even on py35 with eventlet 0.20.0).

The issue also occurs on stable/rocky, with eventlet 0.20.0.

Prior to https://review.openstack.org/599032 there were similar issues with the oslo_service.tests.test_service.EventletServerServiceLauncherTest.test_graceful_stop_with_exceeded_graceful_shutdown_timeout and oslo_service.tests.test_service.EventletServerServiceLauncherTest.test_shuts_down_on_sigint_when_client_connected tests, which suggests that maybe that fix was incomplete?

Zane Bitter (zaneb) on 2018-12-10
Changed in oslo.service:
assignee: nobody → Zane Bitter (zaneb)
Zane Bitter (zaneb) wrote :

The fix for bug 1788022 broke the fix for bug 1705047 - we check for the presence of the select.poll() function to avoid adding EINTR exceptions on Windows, but neglect to take into account the fact that eventlet monkey-patches the select module with its own version that also does not include the poll() function. So the extra handling to interrupt system calls that was supposed to be added in Python 3.5 and later was disabled - at least in the unit tests, and probably in real use in most cases, though possibly depending on the order of imports and such.

This, combined with a programming error in the unit test which causes it not to wait for anything to happen after sending SIGHUP to a child worker, caused the unit test to take 60s. This is because the cleanup code for the test sends SIGTERM to the service process immediately, and the child also receives the SIGTERM before it has had the chance to process the SIGHUP (since it is still sleeping and the sleep cannot be interrupted to schedule the handler greenthread due to PEP475).

The _reload_service() handler clears all signal handler callbacks, with the result that when the SIGTERM handler is scheduled (next time the timer task runs, since that is the next time that eventlet awakes from sleep), it doesn't do anything. The parent process waits for the child to gracefully shutdown and finally exits with SIGALRM after graceful_shutdown_timeout, which is <drum roll> 60s.

Fix proposed to branch: master
Review: https://review.openstack.org/624006

Changed in oslo.service:
status: Confirmed → In Progress
Zane Bitter (zaneb) on 2018-12-11
summary: - py3: test_child_signal_sighup takes ~500 times longer to run on py3
+ eventlet not waking after signal on py3 (and test_child_signal_sighup
+ takes >60s/times out)

Reviewed: https://review.openstack.org/624006
Committed: https://git.openstack.org/cgit/openstack/oslo.service/commit/?id=159ef2e1d26f25a5d7a0514d5155f3c74c4a8a86
Submitter: Zuul
Branch: master

commit 159ef2e1d26f25a5d7a0514d5155f3c74c4a8a86
Author: Zane Bitter <email address hidden>
Date: Mon Dec 10 19:42:30 2018 +1300

    Restore correct signal handling in Python3

    The patch 2ee3894f49f315e35abff968f54ae72e5480e892 broke the original
    fix cad75e4e139f734a5138d37ceafa6be169ff4e47 that ensured eventlet could
    be interrupted while sleeping after PEP475 was implemented in Python
    3.5. Eventlet monkey-patches the signal module with its own version, so
    we have to look up the original module to determine whether the
    underlying OS actually supports the poll() function.

    Change-Id: Ia712c9a83d8081bf0b5e6fe36f169f9028aae3dc
    Closes-Bug: #1803731
    Related-Bug: #1788022
    Related-Bug: #1705047

Changed in oslo.service:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/624007
Committed: https://git.openstack.org/cgit/openstack/oslo.service/commit/?id=130e49feac02e1cb3c13a82efe5bb2c120717c92
Submitter: Zuul
Branch: master

commit 130e49feac02e1cb3c13a82efe5bb2c120717c92
Author: Zane Bitter <email address hidden>
Date: Mon Dec 10 20:38:25 2018 +1300

    Actually test child SIGHUP signal

    The intention of this test was to wait 5s after sending SIGHUP to a
    child process to make sure that it doesn't exit. However, due to a logic
    error, it just stopped checking and declared success immediately. Fix
    the logic so that we have a better chance of seeing if SIGHUP
    incorrectly kills the process.

    Change-Id: I1f320a8dfdd7a922b461d070491ad53e6cd2b20d
    Related-Bug: #1803731

This issue was fixed in the openstack/oslo.service 1.34.0 release.

Reviewed: https://review.openstack.org/626398
Committed: https://git.openstack.org/cgit/openstack/oslo.service/commit/?id=d1295d45eedbae7f524026af207c8196a8925fa9
Submitter: Zuul
Branch: stable/rocky

commit d1295d45eedbae7f524026af207c8196a8925fa9
Author: Zane Bitter <email address hidden>
Date: Mon Dec 10 19:42:30 2018 +1300

    Restore correct signal handling in Python3

    The patch c8c8946d3281ef65de70b03dd198b6be6257fa7f broke the original
    fix cad75e4e139f734a5138d37ceafa6be169ff4e47 that ensured eventlet could
    be interrupted while sleeping after PEP475 was implemented in Python
    3.5. Eventlet monkey-patches the signal module with its own version, so
    we have to look up the original module to determine whether the
    underlying OS actually supports the poll() function.

    Change-Id: Ia712c9a83d8081bf0b5e6fe36f169f9028aae3dc
    Closes-Bug: #1803731
    Related-Bug: #1788022
    Related-Bug: #1705047
    (cherry picked from commit 159ef2e1d26f25a5d7a0514d5155f3c74c4a8a86)

tags: added: in-stable-rocky

This issue was fixed in the openstack/oslo.service 1.31.8 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers