lucid container start failure after calling lxc-stop (fails across reboots)

Bug #819621 reported by Robert Collins
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
lxc (Ubuntu)
Fix Released
High
Unassigned
Nominated for Natty by Serge Hallyn

Bug Description

The following recipe appears to reliably nuke a lxc i386 lucid container running on a oneiric amd64 host:
lxc-start -n <name>
(login)
sudo poweroff -n
- hangs

in another terminal
sudo lxc-stop -n <name>

then
sudo lxc-start -n <name>
shows

sudo lxc-start -n lucid-test-lp
init: plymouth-splash main process (170) terminated with status 2
init: plymouth main process (7) killed by ABRT signal
init: ssh main process (37) terminated with status 255

the output from poweroff is this:

Broadcast message from robertc@lucid-test-lp
        (/dev/console) at 4:23 ...

The system is going down for power off NOW!
robertc@lucid-test-lp:~$ init: tty4 main process (225) killed by TERM signal
init: tty2 main process (226) killed by TERM signal
init: tty3 main process (228) killed by TERM signal
init: tty1 main process (241) killed by TERM signal
init: console main process (2534) killed by TERM signal
Checking for running unattended-upgrades: * Asking all remaining processes to terminate...
   ...done.
 * All processes ended within 1 seconds....
   ...done.
 * Deconfiguring network interfaces...
   ...done.
 * Deactivating swap...
   ...fail!
 * Unmounting weak filesystems...
   ...done.
mount: / is busy
 * Will now halt

Related branches

Revision history for this message
Robert Collins (lifeless) wrote :

Further investigation shows that just:
logging into the new container
installing some packages (see https://dev.launchpad.net/Running/LXC)
and calling lxc-stop from another terminal

will cause this.

Am progressing with shorter and shorter tests

summary: - container start failure after calling poweroff -n
+ lucid container start failure after calling lxc-stop (fails across
+ reboots)
Revision history for this message
Robert Collins (lifeless) wrote :

start, login, log out, lxc-stop from another terminal
-> the broken situation.

Revision history for this message
Robert Collins (lifeless) wrote :

And confirmed its still nuked after a reboot.

Revision history for this message
Robert Collins (lifeless) wrote :

So now I made a clean container; copied it to (sudo cp -a ) to a temp dir, ran the original, and now have a diff of the changes. I've not yet tested to see if reverting them fixes it.?field.comment=So now I made a clean container; copied it to (sudo cp -a ) to a temp dir, ran the original, and now have a diff of the changes. I've not yet tested to see if reverting them fixes it.

the mtab being nonempty in the nonrunning container is a little suspect

as are the .udev files *on disk* (mainly symlinks it appears)

Revision history for this message
Robert Collins (lifeless) wrote :

ps fax for the broken container starting up;

 4134 pts/0 S+ 0:00 | \_ sudo lxc-start -n lucid-test-lp
 4135 pts/0 S+ 0:00 | \_ lxc-start -n lucid-test-lp
 4150 ? Ss 0:00 | \_ /sbin/init
 4192 ? S 0:00 | \_ upstart-udev-bridge --daemon
 4197 ? S<s 0:00 | \_ udevd --daemon
 4273 ? S< 0:00 | | \_ udevd --daemon
 4274 ? S< 0:00 | | \_ udevd --daemon
 4200 ? Ss 0:00 | \_ /usr/sbin/sshd

Revision history for this message
Robert Collins (lifeless) wrote :

Pre breakage:

 4534 pts/0 S+ 0:00 | \_ sudo lxc-start -n lucid-test-lp
 4535 pts/0 S+ 0:00 | \_ lxc-start -n lucid-test-lp
 4546 ? Ss 0:00 | \_ /sbin/init
 4593 ? S 0:00 | \_ upstart-udev-bridge --daemon
 4597 ? S<s 0:00 | \_ udevd --daemon
 4728 ? S< 0:00 | | \_ udevd --daemon
 4729 ? S< 0:00 | | \_ udevd --daemon
 4936 pts/4 Ss+ 0:00 | \_ /sbin/getty -8 38400 tty4
 4937 pts/2 Ss+ 0:00 | \_ /sbin/getty -8 38400 tty2
 4939 pts/3 Ss+ 0:00 | \_ /sbin/getty -8 38400 tty3
 4945 ? S 0:00 | \_ /bin/sh /etc/init.d/ondemand background
 4957 ? S 0:00 | | \_ sleep 60
 4952 pts/1 Ss+ 0:00 | \_ /sbin/getty -8 38400 tty1
 4953 pts/5 Ss+ 0:00 | \_ /sbin/getty -8 38400 /dev/console
 4982 ? Ss 0:00 | \_ dhclient3 -e IF_METRIC=100 -pf /var/run/dhclient.eth0.pid -lf /var/lib/dhcp3/dhclient.eth0.leases eth0
 4997 ? Ss 0:00 | \_ /usr/sbin/sshd

Revision history for this message
Robert Collins (lifeless) wrote :

 4936 pts/4 Ss+ 0:00 | \_ /sbin/getty -8 38400 tty4
 4937 pts/2 Ss+ 0:00 | \_ /sbin/getty -8 38400 tty2
 4939 pts/3 Ss+ 0:00 | \_ /sbin/getty -8 38400 tty3
 4945 ? S 0:00 | \_ /bin/sh /etc/init.d/ondemand background
 4957 ? S 0:00 | | \_ sleep 60
 4952 pts/1 Ss+ 0:00 | \_ /sbin/getty -8 38400 tty1
 4953 pts/5 Ss+ 0:00 | \_ /sbin/getty -8 38400 /dev/console
 4982 ? Ss 0:00 | \_ dhclient3 -e IF_METRIC=100 -pf /var/run/dhclient.eth0.pid -lf /var/lib/dhcp3/dhclient.eth0.leases eth0

look to be the unique processes

Revision history for this message
Robert Collins (lifeless) wrote :

(Meta: does ondemand make any sense for containers anyhow?)

Changed in lxc (Ubuntu):
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Thanks, Robert.

It looks like ssh is not being properly killed. When we force it to die with lxc-stop, it does not remove its /var/run/ files. Then it fails to start bc those already exist.

Until this is fixed, you can work around this by doing

  rm -rf /var/lib/lxc/<container-name>/rootfs/var/run
  mkdir !!:2

After this I'm able to restart the lucid container. You can automate this by changing your container's /etc/init/lxcmount to the one I'm about to attach. With that, I still need to kill the container by hand, but then lxc-start works.

I'm tempted to wait to see if we can implement proper container reboot/poweroff support at the lxc sprint next week, because the lxc-monitor watching /var/run/utmp is an ugly hack anyway, and continually finding ways to fix its breaks does not seem productive.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Replacing a lucid container's /etc/init/lxcmount.conf with this attachment should allow the container to start after a forced stop.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Note also that when we get proper poweroff/reboot support, we can have /lib/init/fstab mount a tmpfs onto /var/run, which will prevent these issues with /var/run/* files persisting.

Revision history for this message
Robert Collins (lifeless) wrote :

indeed, brilliant - thanks. I guess this needs a new lxcguest package?

Revision history for this message
Robert Collins (lifeless) wrote :

This workaround is a little coarse - it breaks utmp depending things (like postgresql :P)

so
rm -rf /var/run
touch /var/run/utmp
chmod root:utmp /var/run/utmp

That seems to be more robust.

Revision history for this message
Robert Collins (lifeless) wrote :
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package lxc - 0.7.4.2-3ubuntu6

---------------
lxc (0.7.4.2-3ubuntu6) oneiric; urgency=low

  * Add lxc-start-ephemeral by Robert Collins (LP: #807351)
  * Add a --quit-on-stop arg to lxc-monitor for use by lxc-start-ephemeral.
  * Modify lxcguest.conf to clear out /var/run (LP: #819621)
  * Fix a bug in lxc-ps when cgroup-bin is not mounted.
  * Modify lxc-ps to accept '-n name' and support '--' to separate options
    for ps. (LP: #820720)
 -- Serge Hallyn <email address hidden> Wed, 03 Aug 2011 19:48:11 -0500

Changed in lxc (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
Aaz (aaz2009) wrote :

The same recipe works for natty amd64 container running on natty amd64 host.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@Aaz,

I can't actually reproduce this on a fresh natty install. Can you file a new bug showing precise reproduction instructions?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.