Ubuntu
lxc package

lucid container start failure after calling lxc-stop (fails across reboots)

Bug #819621 reported by Robert Collins on 2011-08-02

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	lxc (Ubuntu)	Fix Released	High	Unassigned
Nominated for Natty by Serge Hallyn

Bug Description

The following recipe appears to reliably nuke a lxc i386 lucid container running on a oneiric amd64 host:
lxc-start -n <name>
(login)
sudo poweroff -n
- hangs

in another terminal
sudo lxc-stop -n <name>

then
sudo lxc-start -n <name>
shows

sudo lxc-start -n lucid-test-lp
init: plymouth-splash main process (170) terminated with status 2
init: plymouth main process (7) killed by ABRT signal
init: ssh main process (37) terminated with status 255

the output from poweroff is this:

Broadcast message from robertc@lucid-test-lp
(/dev/console) at 4:23 ...

The system is going down for power off NOW!
robertc@lucid-test-lp:~$ init: tty4 main process (225) killed by TERM signal
init: tty2 main process (226) killed by TERM signal
init: tty3 main process (228) killed by TERM signal
init: tty1 main process (241) killed by TERM signal
init: console main process (2534) killed by TERM signal
Checking for running unattended-upgrades: * Asking all remaining processes to terminate...
   ...done.
* All processes ended within 1 seconds....
   ...done.
* Deconfiguring network interfaces...
   ...done.
* Deactivating swap...
   ...fail!
* Unmounting weak filesystems...
   ...done.
mount: / is busy
* Will now halt

Related branches

lp:ubuntu/oneiric/lxc

Revision history for this message

Robert Collins (lifeless) wrote on 2011-08-02:

Further investigation shows that just:
logging into the new container
installing some packages (see https://dev.launchpad.net/Running/LXC)
and calling lxc-stop from another terminal

will cause this.

Am progressing with shorter and shorter tests

summary:

- container start failure after calling poweroff -n
+ lucid container start failure after calling lxc-stop (fails across
+ reboots)

Revision history for this message

Robert Collins (lifeless) wrote on 2011-08-02:

start, login, log out, lxc-stop from another terminal
-> the broken situation.

Revision history for this message

Robert Collins (lifeless) wrote on 2011-08-02:

And confirmed its still nuked after a reboot.

Revision history for this message

Robert Collins (lifeless) wrote on 2011-08-02:

diff of broken, working containers Edit (137.9 KiB, text/plain)

So now I made a clean container; copied it to (sudo cp -a ) to a temp dir, ran the original, and now have a diff of the changes. I've not yet tested to see if reverting them fixes it.?field.comment=So now I made a clean container; copied it to (sudo cp -a ) to a temp dir, ran the original, and now have a diff of the changes. I've not yet tested to see if reverting them fixes it.

the mtab being nonempty in the nonrunning container is a little suspect

as are the .udev files *on disk* (mainly symlinks it appears)

Revision history for this message

Robert Collins (lifeless) wrote on 2011-08-02:

ps fax for the broken container starting up;

Revision history for this message

Robert Collins (lifeless) wrote on 2011-08-02:

Pre breakage:

4534 pts/0 S+ 0:00 | \_ sudo lxc-start -n lucid-test-lp
4535 pts/0 S+ 0:00 | \_ lxc-start -n lucid-test-lp
4546 ? Ss 0:00 | \_ /sbin/init
4593 ? S 0:00 | \_ upstart-udev-bridge --daemon
4597 ? S<s 0:00 | \_ udevd --daemon
4728 ? S< 0:00 | | \_ udevd --daemon
4729 ? S< 0:00 | | \_ udevd --daemon
4936 pts/4 Ss+ 0:00 | \_ /sbin/getty -8 38400 tty4
4937 pts/2 Ss+ 0:00 | \_ /sbin/getty -8 38400 tty2
4939 pts/3 Ss+ 0:00 | \_ /sbin/getty -8 38400 tty3
4945 ? S 0:00 | \_ /bin/sh /etc/init.d/ondemand background
4957 ? S 0:00 | | \_ sleep 60
4952 pts/1 Ss+ 0:00 | \_ /sbin/getty -8 38400 tty1
4953 pts/5 Ss+ 0:00 | \_ /sbin/getty -8 38400 /dev/console
4982 ? Ss 0:00 | \_ dhclient3 -e IF_METRIC=100 -pf /var/run/dhclient.eth0.pid -lf /var/lib/dhcp3/dhclient.eth0.leases eth0
4997 ? Ss 0:00 | \_ /usr/sbin/sshd

Revision history for this message

Robert Collins (lifeless) wrote on 2011-08-02:

4936 pts/4 Ss+ 0:00 | \_ /sbin/getty -8 38400 tty4
4937 pts/2 Ss+ 0:00 | \_ /sbin/getty -8 38400 tty2
4939 pts/3 Ss+ 0:00 | \_ /sbin/getty -8 38400 tty3
4945 ? S 0:00 | \_ /bin/sh /etc/init.d/ondemand background
4957 ? S 0:00 | | \_ sleep 60
4952 pts/1 Ss+ 0:00 | \_ /sbin/getty -8 38400 tty1
4953 pts/5 Ss+ 0:00 | \_ /sbin/getty -8 38400 /dev/console
4982 ? Ss 0:00 | \_ dhclient3 -e IF_METRIC=100 -pf /var/run/dhclient.eth0.pid -lf /var/lib/dhcp3/dhclient.eth0.leases eth0

look to be the unique processes

Revision history for this message

Robert Collins (lifeless) wrote on 2011-08-02:

(Meta: does ondemand make any sense for containers anyhow?)

Serge Hallyn (serge-hallyn) on 2011-08-02

Changed in lxc (Ubuntu):
status:	New → Confirmed
importance:	Undecided → High

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2011-08-02:

Thanks, Robert.

It looks like ssh is not being properly killed. When we force it to die with lxc-stop, it does not remove its /var/run/ files. Then it fails to start bc those already exist.

Until this is fixed, you can work around this by doing

rm -rf /var/lib/lxc/<container-name>/rootfs/var/run
mkdir !!:2

After this I'm able to restart the lucid container. You can automate this by changing your container's /etc/init/lxcmount to the one I'm about to attach. With that, I still need to kill the container by hand, but then lxc-start works.

I'm tempted to wait to see if we can implement proper container reboot/poweroff support at the lxc sprint next week, because the lxc-monitor watching /var/run/utmp is an ugly hack anyway, and continually finding ways to fix its breaks does not seem productive.

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2011-08-02:

#10

lxcmount.conf Edit (858 bytes, text/plain)

Replacing a lucid container's /etc/init/lxcmount.conf with this attachment should allow the container to start after a forced stop.

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2011-08-02:

#11

Note also that when we get proper poweroff/reboot support, we can have /lib/init/fstab mount a tmpfs onto /var/run, which will prevent these issues with /var/run/* files persisting.

Revision history for this message

Robert Collins (lifeless) wrote on 2011-08-02:

#12

indeed, brilliant - thanks. I guess this needs a new lxcguest package?

Revision history for this message

Robert Collins (lifeless) wrote on 2011-08-03:

#13

This workaround is a little coarse - it breaks utmp depending things (like postgresql :P)

so
rm -rf /var/run
touch /var/run/utmp
chmod root:utmp /var/run/utmp

That seems to be more robust.

Revision history for this message

Robert Collins (lifeless) wrote on 2011-08-03:

#14

utmp safe version Edit (910 bytes, text/plain)

Revision history for this message

Launchpad Janitor (janitor) wrote on 2011-08-05:

#15

This bug was fixed in the package lxc - 0.7.4.2-3ubuntu6

---------------
lxc (0.7.4.2-3ubuntu6) oneiric; urgency=low

  * Add lxc-start-ephemeral by Robert Collins (LP: #807351)
  * Add a --quit-on-stop arg to lxc-monitor for use by lxc-start-ephemeral.
  * Modify lxcguest.conf to clear out /var/run (LP: #819621)
  * Fix a bug in lxc-ps when cgroup-bin is not mounted.
  * Modify lxc-ps to accept '-n name' and support '--' to separate options
    for ps. (LP: #820720)
-- Serge Hallyn <email address hidden> Wed, 03 Aug 2011 19:48:11 -0500

Changed in lxc (Ubuntu):
status:	Confirmed → Fix Released

Revision history for this message

Aaz (aaz2009) wrote on 2011-09-10:

#16

The same recipe works for natty amd64 container running on natty amd64 host.

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2011-09-19:

#17

@Aaz,

I can't actually reproduce this on a fresh natty install. Can you file a new bug showing precise reproduction instructions?

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntulxc package

lucid container start failure after calling lxc-stop (fails across reboots)

Bug Description

Related branches

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
lxc package