plainbox extcmd can deadlock in certain cases

Bug #1382433 reported by Zygmunt Krynicki on 2014-10-17
This bug affects 4 people
Affects Status Importance Assigned to Milestone
PlainBox (Toolkit)

Bug Description

We've found that the extcmd module that we use to run external commands can deadlock in certain cases.

An "easy" reproduction case is to run a process that backgrounds (ssh -M -f -N). Extcmd logging seems to indicate that IO threads don't quit as they still linger, sucking pipes that aren't closed.

Manually forcing the pipes closed seems to block (?). I'll write a reproduction case and update this bug

Zygmunt Krynicki (zyga) on 2014-10-17
Changed in plainbox-provider-checkbox:
status: New → Confirmed
importance: Undecided → Critical
affects: plainbox-provider-checkbox → plainbox
Changed in plainbox:
milestone: none → 0.15
assignee: nobody → Zygmunt Krynicki (zkrynicki)
Zygmunt Krynicki (zyga) on 2014-10-17
description: updated
Revision history for this message
Zygmunt Krynicki (zyga) wrote :

I'm still working on this. The EOD prototype that watches another process reliably:

Revision history for this message
Zygmunt Krynicki (zyga) wrote :

The reproduction test case with the new code (it also hangs but that's deliberate^Hnot addressed)

Clearly, the problem is that ssh inherits stdout/stderr pipes when started in -M -N -f mode. This is (partially) a ssh bug partially a desired feature as it wants to pop up interactive dialogs / console prompts in certain cases *after* it starts up.

The solution is to make sure we do something useful after we see SIGCHLD (which the new code *does* see)

Changed in plainbox:
status: Confirmed → In Progress
Revision history for this message
Zygmunt Krynicki (zyga) wrote :

The current approach, that so far works good on all the pathological cases is to do this.

The general idea is that we're in a I/O loop, using non-blocking I/O and a select-like mechanism to wait for wait for:
 - timeout (optional, new feature)
 - read side of the stdout pipe data
 - read side of the stdout pipe being closed
 - read side of the stderr pipe data
 - read side of the stderr pipe being closed
 - SIGCHLD being delivered with the intent to say that the process is dead

In general we keep looping and terminate onlu when the set of waited things (stdout depleted, stderr depleted, process terminated) is empty. This is not always true so see below. The action that we do on each is event is obviously different:

If the timeout has elapsed we proceed to send SIGTERM, reset the timer for shutdown period, followed by SIGQUIT and another timer reset. After that we send SIGKILL. This can fail as the process may have elevated itself beyond our capabilities. This is still undecided but perhaps, at this time, we should use an elevated process manager (see below). If we fail to terminate the process special provisions apply (see below).

If we have data to read we just do and process that (send to log files, process, send to .record.gz). This is a point where we can optimize the process and improve reliability in event of sudden system crash. Using more modern facilities we can implement tee in kernel space which lowers processing burden on python and, in general, makes it more likely that the log files will see actual output the process made just prior to its death.

We can also use pipes in O_DIRECT (aka packet mode) here to ensure that all writes() end up as individual records, which is the indented design of the I/O log record concept. This won't address the inherent buffering that is enabled in all programs that detect when they are redirected and no longer attached to a tty.

Whenever one of the pipes is depleted (which may *never* happen, lesson learned) we just close our side.

When the child dies, and this is the most important part and the actual bugfix, we do the following sequence of events:
 - if we still have stdout pipe open, read at most one PIPE_BUF. We cannot read more as the pipe may live on forever and we can just hang as we currently do. Reading one PIPE_BUF ensures that we catch the last moments of what the originally started process intended to tell us. Then we close the pipe. This will likely result in SIGPIPE in any processes that are still attached to it though we have no guarantee that it will rally kill them as that singal can be blocked.
 - if we still have stderr pipe open we follow the same logic as for stdout above.
 - we restore some signal handling that was blocked during the execution of the loop and terminate.

There's one more trick up our sleeve and that is PR_SET_CHILD_SUBREAPER but I'll describe that in a separate bug report that deals with runaway processes. Think dbus-launch or anything that double-forks and deamonizes

Revision history for this message
Jerry Kao (jerry.kao) wrote :

suspend/cycle_resolutions_after_suspend can not be executed properly

This issue can be seen on HP Z802-2 and 400 G2 test with stella-datong 140729 image.
kernel 3.13.0-34

NVIDIA Corporation GK106GL [Quadro K4000] [10de:11fa]
NVIDIA Corporation GF119 [NVS 310] [10de:107d]

tags: added: ce-qa-concern
Changed in plainbox:
milestone: 0.15 → 0.16
Changed in plainbox:
milestone: 0.16 → 0.17
Revision history for this message
Daniel Lin (danielyhlin) wrote :

Manual test /Testing resolution cycling for cedar[Radeon HD 7350] can not be executed properly

This issue can be seen on bDT Murdrad2 test with stella-datong140729-trusty-amd64-20140925-3.
bios:L02 v00.22
kernel 3.13.0-34

Advanced Micro Devices, Inc. [AMD/ATI] Cedar [Radeon HD 7350/8350]

Revision history for this message
Daniel Lin (danielyhlin) wrote :

Plainbox version as follows:

ii plainbox 0.16~ppa~ubuntu14.04.1 all toolkit for software and hardware integration testing
ii plainbox-glmark2-es2-meta 0.11~ppa~ubuntu14.04.1 amd64 metapackage to selectively install glmark2-es2
ii plainbox-insecure-policy 0.16~ppa~ubuntu14.04.1 all policykit policy required to use plainbox (insecure version)
ii plainbox-provider-certification-client 0.11~ppa~ubuntu14.04.1 all Client Certification
ii plainbox-provider-checkbox 0.15~ppa2~ubuntu14.04.1 amd64 CheckBox provider for PlainBox
ii plainbox-provider-oem 0.1ubuntu31 all plainbox oem provider
ii plainbox-provider-oem-stella 0.1ubuntu31 all plainbox oem provider for stella
ii plainbox-provider-resource-generic 0.13~ppa~ubuntu14.04.1 amd64 CheckBox generic resource jobs provider
ii python3-plainbox 0.16~ppa~ubuntu14.04.1 all toolkit for software and hardware tes

Changed in plainbox:
milestone: 0.17 → none
Zygmunt Krynicki (zyga) on 2014-12-12
Changed in plainbox:
milestone: none → 0.18
status: In Progress → Fix Committed
Revision history for this message
Zygmunt Krynicki (zyga) wrote :

I've reverted the status as this bug wasn't really fixed (as the fix cannot be used on checkbox-gui/checkbox-touch)

Changed in plainbox:
status: Fix Committed → In Progress
milestone: 0.18 → none
Zygmunt Krynicki (zyga) on 2015-04-09
Changed in plainbox:
assignee: Zygmunt Krynicki (zyga) → nobody
Zygmunt Krynicki (zyga) on 2015-04-16
Changed in plainbox:
status: In Progress → Confirmed
milestone: none → future
Revision history for this message
Zygmunt Krynicki (zyga) wrote :

This bug is still important but not as much as before. We need to return to the extcmd layer one day and discard all but one implementation while fixing all the issues we've identified since.

Changed in plainbox:
importance: Critical → High
tags: added: extcmd
Changed in plainbox:
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers