AsyncProcess.stop() can lead to deadlock

Bug #1506021 reported by John Schwarz
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
John Schwarz

Bug Description

The bug occurs when calling stop() on an AsyncProcess instance which is running a progress generating substantial amounts of output to stdout/stderr and that has a signal handler for some signal (SIGTERM for example) that causes the program to exit gracefully.

Linux Pipes 101: when calling write() to some one-way pipe, if the pipe is full of data [1], write() will block until the other end read()s from the pipe.

AsyncProcess is using eventlet.green.subprocess to create an eventlet-safe subprocess, using stdout=subprocess.PIPE and stderr=subprocess.PIPE. In other words, stdout and stderr are redirected to a one-way linux pipe to the executing AsyncProcess. When stopping the subprocess, the current code [2] first kills the readers used to empty stdout/stderr and only then sends the signal.

It is clear that if SIGTERM is sent to the subprocess, and if the subprocess is generating a lot of output to stdout/stderr AFTER the readers were killed, a deadlock is achieved: the parent process is blocking on wait() and the subprocess is blocking on write() (waiting for someone to read and empty the pipe).

This can be avoided by sending SIGKILL to the AsyncProcesses (this is the code's default), but other signals such as SIGTERM, that can be handled by the userspace code to cause the process to exit gracefully, might trigger this deadlock. For example, I ran into this while trying to modify existing fullstack tests to SIGTERM processes instead of SIGKILL them, and the ovs agent got deadlocked a lot.

[1]: http://linux.die.net/man/7/pipe (Section called "Pipe capacity")
[2]: https://github.com/openstack/neutron/blob/stable/liberty/neutron/agent/linux/async_process.py#L163

John Schwarz (jschwarz)
Changed in neutron:
assignee: nobody → John Schwarz (jschwarz)
Changed in neutron:
importance: Undecided → Medium
status: New → Confirmed
tags: added: liberty-rc-potential
Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

Nice write up, I think that matches what you were seeing, and makes perfect sense.

Revision history for this message
John Schwarz (jschwarz) wrote : Re: [Bug 1506021] Re: AsyncProcess.stop() can lead to deadlock

Thanks a lot, Miguel :)

On 10/14/2015 03:21 PM, Miguel Angel Ajo wrote:
> Nice write up, I think that matches what you were seeing, and makes
> perfect sense.
>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/234770

Changed in neutron:
status: Confirmed → In Progress
Akihiro Motoki (amotoki)
tags: added: liberty-backport-potential
removed: liberty-rc-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/234770
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ddaee9f060c8fd1585d3e06e04eb301518ff90e0
Submitter: Jenkins
Branch: master

commit ddaee9f060c8fd1585d3e06e04eb301518ff90e0
Author: John Schwarz <email address hidden>
Date: Wed Oct 14 15:39:33 2015 +0300

    Keep reading stdout/stderr until after kill

    Currently, when calling AsyncProcess.stop(), the code stops the stdout
    and stderr readers and kills the process. There exists an end case (as
    described in the bug report) in which after the readers have been
    stopped the sub-process will generate a substantial amount of outputs to
    either fd. Since the 'subprocess' module is launched with
    subprocess.PIPE as stdout/stderr, and since Linux's pipes can be filled
    to the point where writing new data to them will block, this may cause a
    deadlock if the sub-process has a signal handler for the signal (for
    example, the process is handling SIGTERM to produce a graceful exit of
    the program).

    Therefore, this patch proposes to only kill the readers until AFTER
    wait() returned and the process truly died. Also, relying on _kill_event
    had to cease since invoking its send() method caused a logical loop back
    to _kill, causing eventlet errors.

    A different possible solution is closing the stdout/stderr pipes. Alas,
    this may raise an exception in the sub-process ("what? No stdout?!
    Crash!") and defeats the 'graceful' part of the process.

    Closes-Bug: #1506021
    Change-Id: I506c41c634a8d656d81a8ad7963412b834bdfa5b

Changed in neutron:
status: In Progress → Fix Committed
Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/neutron 8.0.0.0b1

This issue was fixed in the openstack/neutron 8.0.0.0b1 development milestone.

Changed in neutron:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/272682

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/liberty)

Reviewed: https://review.openstack.org/272682
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a133de310035ba60f67d45a20007f24638274c92
Submitter: Jenkins
Branch: stable/liberty

commit a133de310035ba60f67d45a20007f24638274c92
Author: John Schwarz <email address hidden>
Date: Wed Oct 14 15:39:33 2015 +0300

    Keep reading stdout/stderr until after kill

    Currently, when calling AsyncProcess.stop(), the code stops the stdout
    and stderr readers and kills the process. There exists an end case (as
    described in the bug report) in which after the readers have been
    stopped the sub-process will generate a substantial amount of outputs to
    either fd. Since the 'subprocess' module is launched with
    subprocess.PIPE as stdout/stderr, and since Linux's pipes can be filled
    to the point where writing new data to them will block, this may cause a
    deadlock if the sub-process has a signal handler for the signal (for
    example, the process is handling SIGTERM to produce a graceful exit of
    the program).

    Therefore, this patch proposes to only kill the readers until AFTER
    wait() returned and the process truly died. Also, relying on _kill_event
    had to cease since invoking its send() method caused a logical loop back
    to _kill, causing eventlet errors.

    A different possible solution is closing the stdout/stderr pipes. Alas,
    this may raise an exception in the sub-process ("what? No stdout?!
    Crash!") and defeats the 'graceful' part of the process.

    Closes-Bug: #1506021
    Change-Id: I506c41c634a8d656d81a8ad7963412b834bdfa5b
    (cherry picked from commit ddaee9f060c8fd1585d3e06e04eb301518ff90e0)

tags: added: in-stable-liberty
Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/neutron 7.0.3

This issue was fixed in the openstack/neutron 7.0.3 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/liberty)

Reviewed: https://review.openstack.org/277347
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=306636e64dc884484aad772dd6994299d30c2a06
Submitter: Jenkins
Branch: stable/liberty

commit 306636e64dc884484aad772dd6994299d30c2a06
Author: Hong Hui Xiao <email address hidden>
Date: Tue Nov 24 09:01:48 2015 -0500

    Wait for the watch process in test case

    Because the _watch_process and the failing_process are asynchronous,
    there might be a chance that failing_process exit and _watch_process
    is not executed.

    If the _watch_process is blocked, the method that will be asserted
    will not be called. This will fail the UT, but it is intermittent.

    Change-Id: Ic951c1b91c5a10462f548544a5e8d482c52ad665
    Closes-Bug: #1519160
    Related-Bug: #1543040
    Related Bug: #1506021
    (cherry picked from commit dcd0498c17ae860c55a92f9f31b3e3fa0460b78e)

tags: removed: liberty-backport-potential
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.