lucid container start failure after calling lxc-stop (fails across reboots)

Bug #819621 reported by Robert Collins on 2011-08-02
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
lxc (Ubuntu)
High
Unassigned
Nominated for Natty by Serge Hallyn

Bug Description

The following recipe appears to reliably nuke a lxc i386 lucid container running on a oneiric amd64 host:
lxc-start -n <name>
(login)
sudo poweroff -n
- hangs

in another terminal
sudo lxc-stop -n <name>

then
sudo lxc-start -n <name>
shows

sudo lxc-start -n lucid-test-lp
init: plymouth-splash main process (170) terminated with status 2
init: plymouth main process (7) killed by ABRT signal
init: ssh main process (37) terminated with status 255

the output from poweroff is this:

Broadcast message from robertc@lucid-test-lp
        (/dev/console) at 4:23 ...

The system is going down for power off NOW!
robertc@lucid-test-lp:~$ init: tty4 main process (225) killed by TERM signal
init: tty2 main process (226) killed by TERM signal
init: tty3 main process (228) killed by TERM signal
init: tty1 main process (241) killed by TERM signal
init: console main process (2534) killed by TERM signal
Checking for running unattended-upgrades: * Asking all remaining processes to terminate...
   ...done.
 * All processes ended within 1 seconds....
   ...done.
 * Deconfiguring network interfaces...
   ...done.
 * Deactivating swap...
   ...fail!
 * Unmounting weak filesystems...
   ...done.
mount: / is busy
 * Will now halt

Related branches

Robert Collins (lifeless) wrote :

Further investigation shows that just:
logging into the new container
installing some packages (see https://dev.launchpad.net/Running/LXC)
and calling lxc-stop from another terminal

will cause this.

Am progressing with shorter and shorter tests

summary: - container start failure after calling poweroff -n
+ lucid container start failure after calling lxc-stop (fails across
+ reboots)
Robert Collins (lifeless) wrote :

start, login, log out, lxc-stop from another terminal
-> the broken situation.

Robert Collins (lifeless) wrote :

And confirmed its still nuked after a reboot.

Robert Collins (lifeless) wrote :

So now I made a clean container; copied it to (sudo cp -a ) to a temp dir, ran the original, and now have a diff of the changes. I've not yet tested to see if reverting them fixes it.?field.comment=So now I made a clean container; copied it to (sudo cp -a ) to a temp dir, ran the original, and now have a diff of the changes. I've not yet tested to see if reverting them fixes it.

the mtab being nonempty in the nonrunning container is a little suspect

as are the .udev files *on disk* (mainly symlinks it appears)

Robert Collins (lifeless) wrote :

ps fax for the broken container starting up;

 4134 pts/0 S+ 0:00 | \_ sudo lxc-start -n lucid-test-lp
 4135 pts/0 S+ 0:00 | \_ lxc-start -n lucid-test-lp
 4150 ? Ss 0:00 | \_ /sbin/init
 4192 ? S 0:00 | \_ upstart-udev-bridge --daemon
 4197 ? S<s 0:00 | \_ udevd --daemon
 4273 ? S< 0:00 | | \_ udevd --daemon
 4274 ? S< 0:00 | | \_ udevd --daemon
 4200 ? Ss 0:00 | \_ /usr/sbin/sshd

Robert Collins (lifeless) wrote :

Pre breakage:

 4534 pts/0 S+ 0:00 | \_ sudo lxc-start -n lucid-test-lp
 4535 pts/0 S+ 0:00 | \_ lxc-start -n lucid-test-lp
 4546 ? Ss 0:00 | \_ /sbin/init
 4593 ? S 0:00 | \_ upstart-udev-bridge --daemon
 4597 ? S<s 0:00 | \_ udevd --daemon
 4728 ? S< 0:00 | | \_ udevd --daemon
 4729 ? S< 0:00 | | \_ udevd --daemon
 4936 pts/4 Ss+ 0:00 | \_ /sbin/getty -8 38400 tty4
 4937 pts/2 Ss+ 0:00 | \_ /sbin/getty -8 38400 tty2
 4939 pts/3 Ss+ 0:00 | \_ /sbin/getty -8 38400 tty3
 4945 ? S 0:00 | \_ /bin/sh /etc/init.d/ondemand background
 4957 ? S 0:00 | | \_ sleep 60
 4952 pts/1 Ss+ 0:00 | \_ /sbin/getty -8 38400 tty1
 4953 pts/5 Ss+ 0:00 | \_ /sbin/getty -8 38400 /dev/console
 4982 ? Ss 0:00 | \_ dhclient3 -e IF_METRIC=100 -pf /var/run/dhclient.eth0.pid -lf /var/lib/dhcp3/dhclient.eth0.leases eth0
 4997 ? Ss 0:00 | \_ /usr/sbin/sshd

Robert Collins (lifeless) wrote :

 4936 pts/4 Ss+ 0:00 | \_ /sbin/getty -8 38400 tty4
 4937 pts/2 Ss+ 0:00 | \_ /sbin/getty -8 38400 tty2
 4939 pts/3 Ss+ 0:00 | \_ /sbin/getty -8 38400 tty3
 4945 ? S 0:00 | \_ /bin/sh /etc/init.d/ondemand background
 4957 ? S 0:00 | | \_ sleep 60
 4952 pts/1 Ss+ 0:00 | \_ /sbin/getty -8 38400 tty1
 4953 pts/5 Ss+ 0:00 | \_ /sbin/getty -8 38400 /dev/console
 4982 ? Ss 0:00 | \_ dhclient3 -e IF_METRIC=100 -pf /var/run/dhclient.eth0.pid -lf /var/lib/dhcp3/dhclient.eth0.leases eth0

look to be the unique processes

Robert Collins (lifeless) wrote :

(Meta: does ondemand make any sense for containers anyhow?)

Changed in lxc (Ubuntu):
status: New → Confirmed
importance: Undecided → High
Serge Hallyn (serge-hallyn) wrote :

Thanks, Robert.

It looks like ssh is not being properly killed. When we force it to die with lxc-stop, it does not remove its /var/run/ files. Then it fails to start bc those already exist.

Until this is fixed, you can work around this by doing

  rm -rf /var/lib/lxc/<container-name>/rootfs/var/run
  mkdir !!:2

After this I'm able to restart the lucid container. You can automate this by changing your container's /etc/init/lxcmount to the one I'm about to attach. With that, I still need to kill the container by hand, but then lxc-start works.

I'm tempted to wait to see if we can implement proper container reboot/poweroff support at the lxc sprint next week, because the lxc-monitor watching /var/run/utmp is an ugly hack anyway, and continually finding ways to fix its breaks does not seem productive.

Serge Hallyn (serge-hallyn) wrote :

Replacing a lucid container's /etc/init/lxcmount.conf with this attachment should allow the container to start after a forced stop.

Serge Hallyn (serge-hallyn) wrote :

Note also that when we get proper poweroff/reboot support, we can have /lib/init/fstab mount a tmpfs onto /var/run, which will prevent these issues with /var/run/* files persisting.

Robert Collins (lifeless) wrote :

indeed, brilliant - thanks. I guess this needs a new lxcguest package?

Robert Collins (lifeless) wrote :

This workaround is a little coarse - it breaks utmp depending things (like postgresql :P)

so
rm -rf /var/run
touch /var/run/utmp
chmod root:utmp /var/run/utmp

That seems to be more robust.

Robert Collins (lifeless) wrote :
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package lxc - 0.7.4.2-3ubuntu6

---------------
lxc (0.7.4.2-3ubuntu6) oneiric; urgency=low

  * Add lxc-start-ephemeral by Robert Collins (LP: #807351)
  * Add a --quit-on-stop arg to lxc-monitor for use by lxc-start-ephemeral.
  * Modify lxcguest.conf to clear out /var/run (LP: #819621)
  * Fix a bug in lxc-ps when cgroup-bin is not mounted.
  * Modify lxc-ps to accept '-n name' and support '--' to separate options
    for ps. (LP: #820720)
 -- Serge Hallyn <email address hidden> Wed, 03 Aug 2011 19:48:11 -0500

Changed in lxc (Ubuntu):
status: Confirmed → Fix Released
Aaz (aaz2009) wrote :

The same recipe works for natty amd64 container running on natty amd64 host.

Serge Hallyn (serge-hallyn) wrote :

@Aaz,

I can't actually reproduce this on a fresh natty install. Can you file a new bug showing precise reproduction instructions?

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers