neutron

AsyncProcess.stop() can lead to deadlock

Bug #1506021 reported by John Schwarz on 2015-10-14

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	neutron	Fix Released	Medium	John Schwarz

Bug Description

The bug occurs when calling stop() on an AsyncProcess instance which is running a progress generating substantial amounts of output to stdout/stderr and that has a signal handler for some signal (SIGTERM for example) that causes the program to exit gracefully.

Linux Pipes 101: when calling write() to some one-way pipe, if the pipe is full of data [1], write() will block until the other end read()s from the pipe.

AsyncProcess is using eventlet.green.subprocess to create an eventlet-safe subprocess, using stdout=subprocess.PIPE and stderr=subprocess.PIPE. In other words, stdout and stderr are redirected to a one-way linux pipe to the executing AsyncProcess. When stopping the subprocess, the current code [2] first kills the readers used to empty stdout/stderr and only then sends the signal.

It is clear that if SIGTERM is sent to the subprocess, and if the subprocess is generating a lot of output to stdout/stderr AFTER the readers were killed, a deadlock is achieved: the parent process is blocking on wait() and the subprocess is blocking on write() (waiting for someone to read and empty the pipe).

This can be avoided by sending SIGKILL to the AsyncProcesses (this is the code's default), but other signals such as SIGTERM, that can be handled by the userspace code to cause the process to exit gracefully, might trigger this deadlock. For example, I ran into this while trying to modify existing fullstack tests to SIGTERM processes instead of SIGKILL them, and the ovs agent got deadlocked a lot.

[1]: http://linux.die.net/man/7/pipe (Section called "Pipe capacity")
[2]: https://github.com/openstack/neutron/blob/stable/liberty/neutron/agent/linux/async_process.py#L163

Tags:

John Schwarz (jschwarz) on 2015-10-14

Changed in neutron:
assignee:	nobody → John Schwarz (jschwarz)

Ihar Hrachyshka (ihar-hrachyshka) on 2015-10-14

Changed in neutron:
importance:	Undecided → Medium
status:	New → Confirmed
tags:	added: liberty-rc-potential

Revision history for this message

Miguel Angel Ajo (mangelajo) wrote on 2015-10-14:

Nice write up, I think that matches what you were seeing, and makes perfect sense.

Revision history for this message

John Schwarz (jschwarz) wrote on 2015-10-14: Re: [Bug 1506021] Re: AsyncProcess.stop() can lead to deadlock

Thanks a lot, Miguel :)

On 10/14/2015 03:21 PM, Miguel Angel Ajo wrote:
> Nice write up, I think that matches what you were seeing, and makes
> perfect sense.
>

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-10-14: Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/234770

Changed in neutron:
status:	Confirmed → In Progress

Akihiro Motoki (amotoki) on 2015-10-21

tags:

added: liberty-backport-potential
removed: liberty-rc-potential

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-11-22: Fix merged to neutron (master)

Reviewed: https://review.openstack.org/234770
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ddaee9f060c8fd1585d3e06e04eb301518ff90e0
Submitter: Jenkins
Branch: master

commit ddaee9f060c8fd1585d3e06e04eb301518ff90e0
Author: John Schwarz <email address hidden>
Date: Wed Oct 14 15:39:33 2015 +0300

Keep reading stdout/stderr until after kill

    Currently, when calling AsyncProcess.stop(), the code stops the stdout
    and stderr readers and kills the process. There exists an end case (as
    described in the bug report) in which after the readers have been
    stopped the sub-process will generate a substantial amount of outputs to
    either fd. Since the 'subprocess' module is launched with
    subprocess.PIPE as stdout/stderr, and since Linux's pipes can be filled
    to the point where writing new data to them will block, this may cause a
    deadlock if the sub-process has a signal handler for the signal (for
    example, the process is handling SIGTERM to produce a graceful exit of
    the program).

    Therefore, this patch proposes to only kill the readers until AFTER
    wait() returned and the process truly died. Also, relying on _kill_event
    had to cease since invoking its send() method caused a logical loop back
    to _kill, causing eventlet errors.

    A different possible solution is closing the stdout/stderr pipes. Alas,
    this may raise an exception in the sub-process ("what? No stdout?!
    Crash!") and defeats the 'graceful' part of the process.

Closes-Bug: #1506021
Change-Id: I506c41c634a8d656d81a8ad7963412b834bdfa5b

Changed in neutron:
status:	In Progress → Fix Committed

Revision history for this message

Thierry Carrez (ttx) wrote on 2015-12-03: Fix included in openstack/neutron 8.0.0.0b1

This issue was fixed in the openstack/neutron 8.0.0.0b1 development milestone.

Doug Hellmann (doug-hellmann) on 2015-12-03

Changed in neutron:
status:	Fix Committed → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-26: Fix proposed to neutron (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/272682

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-31: Fix merged to neutron (stable/liberty)

Reviewed: https://review.openstack.org/272682
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a133de310035ba60f67d45a20007f24638274c92
Submitter: Jenkins
Branch: stable/liberty

commit a133de310035ba60f67d45a20007f24638274c92
Author: John Schwarz <email address hidden>
Date: Wed Oct 14 15:39:33 2015 +0300

Keep reading stdout/stderr until after kill

    Closes-Bug: #1506021
    Change-Id: I506c41c634a8d656d81a8ad7963412b834bdfa5b
    (cherry picked from commit ddaee9f060c8fd1585d3e06e04eb301518ff90e0)

tags:

added: in-stable-liberty

Revision history for this message

Doug Hellmann (doug-hellmann) wrote on 2016-02-10: Fix included in openstack/neutron 7.0.3

This issue was fixed in the openstack/neutron 7.0.3 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-15: Related fix merged to neutron (stable/liberty)

Reviewed: https://review.openstack.org/277347
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=306636e64dc884484aad772dd6994299d30c2a06
Submitter: Jenkins
Branch: stable/liberty

commit 306636e64dc884484aad772dd6994299d30c2a06
Author: Hong Hui Xiao <email address hidden>
Date: Tue Nov 24 09:01:48 2015 -0500

Wait for the watch process in test case

    Because the _watch_process and the failing_process are asynchronous,
    there might be a chance that failing_process exit and _watch_process
    is not executed.

If the _watch_process is blocked, the method that will be asserted
will not be called. This will fail the UT, but it is intermittent.

    Change-Id: Ic951c1b91c5a10462f548544a5e8d482c52ad665
    Closes-Bug: #1519160
    Related-Bug: #1543040
    Related Bug: #1506021
    (cherry picked from commit dcd0498c17ae860c55a92f9f31b3e3fa0460b78e)

Ihar Hrachyshka (ihar-hrachyshka) on 2016-10-07

tags:

removed: liberty-backport-potential

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.