lxc-stop and even lxc-stop -k can hang

Bug #1261338 reported by Vincent Ladeuil
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
lxc (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

and even 'lxc-stop -k -t <timeout>' can hang :-/

The context is an automated test run with otto using a lxc-container trigerring a kernel crash on the host.

The host is still alive and so is the container but trying to implement a catch-all to stop a container left running is blocked because there is no way in this particular case to stop the container.

The only alternative so far is to reboot the host :-/

http://q-jenkins.ubuntu-ci:8080/job/autopilot-trusty-daily_release/label=qa-intel-4000/951/console is one occurrence where the kernel crashed and 'lxc-stop -k -t 120 -n trusty-i386-20131216-0008' hanged requiring the job to be aborted.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

What exactly is the kernel crash being triggered on the host?

What does 'sudo strace -f lxc-stop -k -n <container>' show?

What does the file outout show after running 'sudo lxc-stop -l info -o outout -n <container>' ?

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

(I should've explicitly asked: exactly what steps are you taking to reproduce this)

Changed in lxc (Ubuntu):
status: New → Incomplete
importance: Undecided → High
Revision history for this message
Stéphane Graber (stgraber) wrote :

I vaguely remember discussing this on IRC.

IIRC the case was something in the container getting stuck in D state because of a kernel bug. In such case, the container couldn't be killed as anything attempting to do so would get stuck too.

Anyway, this is pretty unfortunate but it's a kernel bug and there's nothing LXC can do to kill an unkillable process so I think our current behaviour is fine and the kernel is what should get fixed.

Changed in lxc (Ubuntu):
status: Incomplete → Invalid
Revision history for this message
Vincent Ladeuil (vila) wrote :

Sorry for not providing timely feedback there (nor a reproducing env).

Truth is this hasn't recur so marking invalid seems fair.

Thanks for the help nevertheless, if it happens again, I'll make sure to have such a reproducing env before re-opening this bug.

Revision history for this message
ivan (anubis00786) wrote :
Download full text (23.6 KiB)

This problem was repeated to me.
executed commands

sudo apt-get install lxc
sudo lxc-create -t ubuntu -n p1
sudo lxc-start -n p1
sudo lxc-stop -n p1 ==execution hangs

sudo strace -f lxc-stop -k -n p1==

execve("/usr/bin/lxc-stop", ["lxc-stop", "-k", "-n", "p1"], [/* 26 vars */]) = 0
brk(0) = 0xb79de000
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7768000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=106845, ...}) = 0
mmap2(NULL, 106845, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb774d000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/usr/lib/i386-linux-gnu/liblxc.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\200\205\0\0004\0\0\0"..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0644, st_size=429132, ...}) = 0
mmap2(NULL, 427812, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb76e4000
mmap2(0xb774b000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x67000) = 0xb774b000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/lib/i386-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\340\233\1\0004\0\0\0"..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=1758972, ...}) = 0
mmap2(NULL, 1768060, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb7534000
mprotect(0xb76dd000, 4096, PROT_NONE) = 0
mmap2(0xb76de000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1a9000) = 0xb76de000
mmap2(0xb76e1000, 10876, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xb76e1000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/lib/i386-linux-gnu/libcap.so.2", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\320\16\0\0004\0\0\0"..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0644, st_size=18048, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7533000
mmap2(NULL, 20824, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb752d000
mmap2(0xb7531000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3000) = 0xb7531000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/usr/lib/i386-linux-gnu/libapparmor.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\300\20\0\0004\0\0\0"..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0644, st_size=42820, ...}) = 0
mmap2(NULL, 45564, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb7521000
mmap2(0xb752b000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x9000) = 0xb752b000
close(3) = 0
access(...

Revision history for this message
ivan (anubis00786) wrote :

ivan@ivan-Satellite-A40:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.1 LTS
Release: 14.04
Codename: trusty

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Thanks for the information. Indeed, lxc-stop -k still works by talking to the lxc monitor. What appears to have happened here is that the monitor itself is frozen.

Could you please show full output of:

ps -ef

Then do 'ps -ef | grep lxc-start' to find the lxc-start process for the hung container. In one terminal do

sudo strace -f -odebug.out -p $pid

where $pid is the process id of the lxc-start. Then in another terminal again do

lxc-stop -n $container -k

and attach the file debug.out to this report.

Changed in lxc (Ubuntu):
status: Invalid → Incomplete
Revision history for this message
ivan (anubis00786) wrote :

all post in file post

Revision history for this message
ivan (anubis00786) wrote :

debug file

Revision history for this message
ivan (anubis00786) wrote :

The second option, if you do not use the -d
all post in file post

Revision history for this message
ivan (anubis00786) wrote :

debug file

Changed in lxc (Ubuntu):
status: Incomplete → New
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in lxc (Ubuntu):
status: New → Confirmed
Revision history for this message
Stéphane Graber (stgraber) wrote :

Not seeing anything obviously wrong here...

To rule out a few more things, when you get the hanging "lxc-stop -n container -k", can you do and paste the output of:
 - ps faux
 - dmesg

Thanks!

Changed in lxc (Ubuntu):
status: Confirmed → Incomplete
importance: High → Undecided
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for lxc (Ubuntu) because there has been no activity for 60 days.]

Changed in lxc (Ubuntu):
status: Incomplete → Expired
Revision history for this message
Vincent Ladeuil (vila) wrote :

It never happened again (and may have been a misuse involving immortal sudo subprocesses but I'm blurry on the details), marking invalid but fixed released may be correct as well.

Changed in lxc (Ubuntu):
status: Expired → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.