race condition causing lxc to not detect container init process exit
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Confirmed
|
Undecided
|
Unassigned | ||
lxc (Ubuntu) |
Fix Released
|
Medium
|
Unassigned |
Bug Description
For the purpose of the repro, my lxc init process is node.js v0.11.0 (built from source) with a single line:
process.exit(0);
When running it in lxc, sometimes lxc doesn't exit. lxc-start remains a parent of a defunct node process without reaping it or exiting.
I've made a custom build of lxc 0.9.0 to extract more information about this, adding only an INFO line, as follows:
start.c:
if (ret != sizeof(siginfo)) {
}
+ INFO("got signal %d from pid %d while expecting SIGCHLD(17) from pid %d | uid = %d, status = %d", siginfo.ssi_signo, siginfo.ssi_pid, *pid, siginfo.ssi_uid, siginfo.
if (siginfo.ssi_signo != SIGCHLD) {
}
I've tried this with a 3 official kernels. There is one difference in output.
Kernels 3.7.9, 3.8.6:
Successful case:
lxc-start 1365724008.446 NOTICE lxc_start - '/usr/local/
lxc-start 1365724008.446 INFO lxc_console - no console will be used
lxc-start 1365724008.446 INFO lxc_start - got signal 17 from pid 18165 while expecting SIGCHLD(17) from pid 19458 | uid = 0, status = 1
lxc-start 1365724008.446 WARN lxc_start - invalid pid for SIGCHLD
lxc-start 1365724038.306 INFO lxc_start - got signal 17 from pid 19458 while expecting SIGCHLD(17) from pid 19458 | uid = 0, status = 0
lxc-start 1365724038.306 DEBUG lxc_start - container init process exited
Hanging case:
lxc-start 1365795195.358 NOTICE lxc_start - '/usr/local/
lxc-start 1365795195.358 INFO lxc_console - no console will be used
lxc-start 1365795195.358 INFO lxc_start - got signal 17 from pid 8626 while expecting SIGCHLD(17) from pid 8650 | uid = 0, status = 1
lxc-start 1365795195.358 WARN lxc_start - invalid pid for SIGCHLD
lxc-start 1365795333.347 INFO lxc_start - got signal 2 from pid 0 while expecting SIGCHLD(17) from pid 8650 | uid = 0, status = 0
lxc-start 1365795333.347 INFO lxc_start - forwarded signal 2 to pid 8650
Kernel 3.9.0-rc6:
Successful case is the same, but the hanging case changes to just:
lxc-start 1365794343.870 NOTICE lxc_start - '/usr/local/
lxc-start 1365794343.870 INFO lxc_console - no console will be used
lxc-start 1365794343.870 INFO lxc_start - got signal 17 from pid 2851 while expecting SIGCHLD(17) from pid 3432 | uid = 0, status = 1
lxc-start 1365794343.870 WARN lxc_start - invalid pid for SIGCHLD
... without forwarding signal 2 (SIGINT).
Notes:
- I'm on Mint 14 Nadia with raring packages, if that helps.
- In all cases, there is signal 17 (SIGCHLD) coming in to lxc-start, but it comes from a different pid and is ignored by lxc. Any idea what this could be? This process seems to have been cleaned up and no longer appears in ps aux.
- The lxc-start process should be getting notified with a SIGCHLD from the child's pid when the child (init process) exits.
- This could be a kernel bug, but it's probably something unique that lxc is doing to trigger it.
- I've tried other init processes (node.js without the process.exit and a custom c++ app with a stdout write and exit 0), which greatly reduce the frequency of this happening.
Changed in lxc (Ubuntu): | |
status: | Incomplete → Confirmed |
Quoting Pavel Bennett (<email address hidden>):
> Public bug reported:
>
> For the purpose of the repro, my lxc init process is node.js v0.11.0
> (built from source) with a single line:
>
> process.exit(0);
Thanks. This is a race condition we've seen before in lxc-execute.
Precisely which version of lxc were you using? The patch
0001-fix- race-with- fast-init
in ubuntu raring's recent lxc packages is meant to work around this,
but we don't have a clean fix yet.
> Notes:
> - I'm on Mint 14 Nadia with raring packages, if that helps.
> - In all cases, there is signal 17 (SIGCHLD) coming in to lxc-start, but it comes from a different pid and is ignored by lxc. Any idea what this could be? This process seems to have been cleaned up and no longer appears in ps aux.
That is the test for kernel support for reboot in a pidns. It's not
related to this.