devel-proposed - android lxc container fails to start

Bug #1551150 reported by Jean-Baptiste Lallement on 2016-02-29
48
This bug affects 9 people
Affects Status Importance Assigned to Milestone
Canonical System Image
Critical
John McAleely
lxc (Ubuntu)
Critical
Ondrej Kubik
lxc-android-config (Ubuntu)
Critical
Ondrej Kubik

Bug Description

Last known good image: ubuntu-touch/devel-proposed/ubuntu/mako 446

ubuntu-touch/devel-proposed/ubuntu/mako 447 (and higher) and equivalent images on other devices do not boot and hang on the boot logo.

From the list of changes, it could be lxc:
446: lxc 2.0.0~rc2-0ubuntu2
447: lxc 2.0.0~rc3-0ubuntu1

Latest image with 2.0.0~rc4-0ubuntu1 doesn't boot either.

lxc (2.0.0~rc3-0ubuntu1) xenial; urgency=medium

  * New upstream release (2.0.0~rc3)
    - Make the cgfs backend and cgns work without cgmanager
    - Manpage updates
    - Mark lxc-clone and lxc-start-ephemeral deprecated (still included)
  * Set --enable-deprecated so we still ship lxc-clone and lxc-start-ephemeral

lxc (2.0.0~rc2-0ubuntu3) xenial; urgency=medium

  * Use versioned dependencies against the various binary packages.
  * Update lxc-templates to depend on lxc1 not lxc. (LP: #1549136)
  * Move the lxcfs recommends from lxc-templates to liblxc1.
  * Drop cgmanager, use the cgfs backend instead.
  * Have liblxc1 depend on systemd | cgroup-lite for cgfs backend.

Device tarballs between 446 and 447 are the same.

Changed in canonical-devices-system-image:
importance: Undecided → Critical
Revision history for this message
Jean-Baptiste Lallement (jibel) wrote :

changes list between 446 and 447.

description: updated
description: updated
description: updated
description: updated
description: updated
Revision history for this message
Jean-Baptiste Lallement (jibel) wrote :

dmesg from a system that doesn't boot.

description: updated
Revision history for this message
Jean-Baptiste Lallement (jibel) wrote :

~ # cat /data/system-data/var/log/lxc/android.log
      lxc-start 20160229125348.090 ERROR lxc_cgfs - cgfs.c:cgfs_init:2300 - cgroupfs failed to detect cgroup metadata
      lxc-start 20160229125348.091 ERROR lxc_start - start.c:lxc_spawn:1026 - failed initializing cgroup support
      lxc-start 20160229125348.091 ERROR lxc_start - start.c:__lxc_start:1276 - failed to spawn 'android'
      lxc-start 20160229125348.091 ERROR lxc_start_ui - lxc_start.c:main:344 - The container failed to start.
      lxc-start 20160229125348.092 ERROR lxc_start_ui - lxc_start.c:main:348 - Additional information can be obtained by setting the --logfile and --logpriority options.

summary: - devel-proposed doesn't boot
+ devel-proposed - android lxc container fails to start
Changed in canonical-devices-system-image:
status: New → Confirmed
milestone: none → ww08-2016
Revision history for this message
Stéphane Graber (stgraber) wrote :

Can you paste the content of /proc/mounts on such a system and confirm that either one of those is true:
 - systemd is installed and used as init system
 - cgroup-lite is installed

Revision history for this message
Jean-Baptiste Lallement (jibel) wrote :

cgroup-lite is not installed:
un cgroup-lite <none>
ii systemd:armhf 229-1ubuntu4

I cannot paste the content of /proc/mounts because I only have adb in recovery (the devices fails before adb starts otherwise)

Revision history for this message
Jean-Baptiste Lallement (jibel) wrote :

I installed cgroup-lite 1.10 but it still doesn't boot.

Revision history for this message
Jean-Baptiste Lallement (jibel) wrote :

With cgroup-lite install the error message is different:
      lxc-start 20160229153538.890 ERROR lxc_cgfs - cgfs.c:cgfs_init:2300 - cgroupfs failed to detect cgroup metadata
      lxc-start 20160229153538.890 ERROR lxc_start - start.c:lxc_spawn:1026 - failed initializing cgroup support
      lxc-start 20160229153538.891 ERROR lxc_start - start.c:__lxc_start:1276 - failed to spawn 'android'
      lxc-start 20160229153538.891 ERROR lxc_start_ui - lxc_start.c:main:344 - The container failed to start.
      lxc-start 20160229153538.891 ERROR lxc_start_ui - lxc_start.c:main:348 - Additional information can be obtained by setting the --logfile and --logpriority options.
      lxc-start 20160229160049.670 ERROR lxc_cgfs - cgfs.c:lxc_cgroupfs_create:874 - Could not find writable mount point for cgroup hierarchy 7 while trying to create cgroup.
      lxc-start 20160229160049.671 ERROR lxc_cgfs - cgfs.c:cgroup_rmdir:208 - Device or resource busy - cgroup_rmdir: failed to delete /sys/fs/cgroup/freezer/
      lxc-start 20160229160049.672 ERROR lxc_cgfs - cgfs.c:cgroup_rmdir:208 - Device or resource busy - cgroup_rmdir: failed to delete /sys/fs/cgroup/cpuacct/
      lxc-start 20160229160049.672 ERROR lxc_cgfs - cgfs.c:cgroup_rmdir:208 - Device or resource busy - cgroup_rmdir: failed to delete /sys/fs/cgroup/cpu/
      lxc-start 20160229160049.672 ERROR lxc_cgfs - cgfs.c:cgroup_rmdir:208 - Device or resource busy - cgroup_rmdir: failed to delete /sys/fs/cgroup/debug/
      lxc-start 20160229160049.672 ERROR lxc_start - start.c:lxc_spawn:1033 - failed creating cgroups
      lxc-start 20160229160049.673 ERROR lxc_start - start.c:__lxc_start:1276 - failed to spawn 'android'
      lxc-start 20160229160049.673 ERROR lxc_start_ui - lxc_start.c:main:344 - The container failed to start.
      lxc-start 20160229160049.673 ERROR lxc_start_ui - lxc_start.c:main:348 - Additional information can be obtained by setting the --logfile and --logpriority options.

Revision history for this message
Stéphane Graber (stgraber) wrote :

We're gonna need a /proc/self/mountinfo output if we want to figure out what the controller #7 is on your system...

Revision history for this message
Stéphane Graber (stgraber) wrote :

Do you know if the phone actually uses systemd as its init system nowadays?

LXC requires either systemd or cgroup-lite to mount all the cgroup bits properly, if systemd is installed but not used, that could explain what you are seeing. Installing cgroup-lite should have fixed it though, unless the container is started before cgroup-lite is done. That's not possible with the regular lxc init script, but if the phone has something custom, it could be possible.

Revision history for this message
Jean-Baptiste Lallement (jibel) wrote :

The phone uses upstart. I'll try to get the content of mountinfo

Revision history for this message
Jean-Baptiste Lallement (jibel) wrote :

android container log with cgroup-lite installed and content of mountinfo from the same boot.

Revision history for this message
Jean-Baptiste Lallement (jibel) wrote :

And syslog

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Hi,

sorry, mountinfo does not show the hierarchies. Can you show /proc/cgroups and /proc/self/cgroup output?

Revision history for this message
Stéphane Graber (stgraber) wrote :

That mountinfo shows no mounted cgroup controller, hinting that cgroup-lite didn't start or otherwise failed to start.

Could you look for a /var/log/upstart/cgroup-lite.log file?

Revision history for this message
Stéphane Graber (stgraber) wrote :

So far that does seem to confirm the hypothesis that since the phone has systemd installed (but unused), this meets lxc's dependency on systemd | cgroup-lite but as upstart is used to boot and upstart itself doesn't mount the cgroup controllers, this results in a system without cgroups mounted.

Having whatever package sets up the android container depend directly on cgroup-lite would fix that dependency issue, though from the logs provided so far, it looks like cgroup-lite is either not starting or is failing.

I'm keeping the LXC task open for now, but there's so far been no indication of an actual bug in LXC or in its packaging, instead the phone seed is doing something that's not seen anywhere else (depend on systemd but not use it) which we can't do anything about from a packaging point of view. The rest of the issue appears to be some bad interaction between cgroup-lite and the init configuration on the Ubuntu phone.

Revision history for this message
Stéphane Graber (stgraber) wrote :

Some more information on cgroup-lite and where it may fail.

 - cgroup lite is triggered on "mounted MOUNTPOINT=/sys/fs/cgroup" => the path is shown as mountend above, so not it
 - The job then gets skipped if /bin/cgroups-mount doesn't exist => part of the cgroup-lite package, so not it
 - The job then gets skipped if /sys/fs/cgroup doesn't exist => was shown as mounted above, so not it
 - cgroup-lite exits if a cgroup entry is found in /etc/fstab
 - cgroup-lite exits if /proc/cgroups doesn't exist => all Ubuntu kernels have it, so not it

So the most likely reason I believe would be a "cgroup" entry of some sort in /etc/fstab, the second most likely reason would be a failure for the "mounted MOUNTPOINT=/sys/fs/cgroup" to be emitted by upstart somehow.

Changed in canonical-devices-system-image:
assignee: nobody → Ondrej Kubik (w-ondra)
Revision history for this message
Jean-Baptiste Lallement (jibel) wrote :

 /proc/cgroups
#subsys_name hierarchy num_cgroups enabled
debug 0 1 1
cpu 0 1 1
cpuacct 0 1 1
freezer 0 1 1

/proc/self/cgroup is empty

There is no /var/log/upstart/cgroup-lite.log file and I verified that the upstart job cgroup-lite starts.

Revision history for this message
Jean-Baptiste Lallement (jibel) wrote :

And the content of fstab
/system/etc # cat fstab
# override the forced fsck from /lib/init/fstab, we use a bindmount which confuses mountall
/dev/root / rootfs defaults 0 0

# swap file
/SWAP.swap none swap sw 0 0

Revision history for this message
Stéphane Graber (stgraber) wrote :

That is very weird, the cgroup-lite upstart job should result in your case in 4 cgroup mounts, so I'm not sure why it's not happening here...

Could you run "bash -x /bin/cgroups-mount" as root and post its output including a dump of /proc/self/mountinfo before and after running it? that should help us figure out why exactly it didn't feel like mounting those cgroups...

Revision history for this message
Ondrej Kubik (ondrak) wrote :

update from running system:
$ cat /proc/cgroups
#subsys_name hierarchy num_cgroups enabled
debug 3 1 1
cpu 1 1 1
cpuacct 2 1 1
freezer 4 1 1

$ cat /proc/self/cgroup
5:name=systemd:/
4:freezer:/
3:debug:/
2:cpuacct:/
1:cpu:/

Revision history for this message
Ondrej Kubik (ondrak) wrote :

lxc-container is now running, system still does not boot, but container is no more issue
needed changes:
install
https://launchpad.net/ubuntu/+source/cgroup-lite/1.10/+build/6001591/+files/cgroup-lite_1.10_all.deb
and modify /etc/init/lxc-android-config.conf to this:
http://paste.ubuntu.com/15261714/

Revision history for this message
Ondrej Kubik (ondrak) wrote :

more debugging and android signal is still not emitted, though android init process is running
Will keep debugging to get closer to the issue
we are progressing....

Revision history for this message
Stéphane Graber (stgraber) wrote :

I have a debdiff ready to upload for the changes so far but will wait until we figure out the rest of this issue.

Serge Hallyn is also working on a cgroup-lite changes to get us rid of most of that diff too.

Revision history for this message
Ondrej Kubik (ondrak) wrote :

missing android signal was caused by double mounting of cgroup cpu, which was already mounted. Fixing that did take boot further, now lightdm becomes alive, still boot fails, and kills adb in the process...

Revision history for this message
Stéphane Graber (stgraber) wrote :

That part of the init script looks just plain wrong to me... I'm guessing the intent was for /dev/cpuctl to be a bind-mount of /sys/fs/cgroup but that's not at all what the code does.

I'll update my local copy here to replace that by a simple symlink from /dev/cpuctl to /sys/fs/cgroup/cpu

Revision history for this message
Stéphane Graber (stgraber) wrote :

Update init script is: http://paste.ubuntu.com/15264484/

kevin gunn (kgunn72) on 2016-03-17
Changed in lxc (Ubuntu):
assignee: nobody → Ondrej Kubik (w-ondra)
Changed in canonical-devices-system-image:
assignee: Ondrej Kubik (w-ondra) → John McAleely (john.mcaleely)
Changed in lxc (Ubuntu):
importance: Undecided → Critical
Changed in canonical-devices-system-image:
milestone: ww08-2016 → backlog
Revision history for this message
Ondrej Kubik (ondrak) wrote :

Finally some time to check more.
Using new /etc/init/lxc-android-config.conf from Stephane http://paste.ubuntu.com/15264484/

When running manually $ lxc-start -n android -F -- /init
I can see init process be created, but then it exits. Also there are no logs in syslog or /proc/kmesg
So looks like init is super unhappy inside the container and does not do much.

But because it actually tries to run, following passes fine
    lxc-wait -n android -s RUNNING -t 30
    containerpid="$(lxc-info -n android -p -H)"
    if [ -n "$containerpid" ]; then

and tries to continue with boot, while container actually fails to boot.
Is there way we can get more logs from running container, I tried this:
lxc-start -l 0 -n android -F -- /init
rm: cannot remove '/var/lib/lxc/android/rootfs/sbin/adbd': No such file or directory
run-parts: /var/lib/lxc/android/pre-start.d/10-no-adbd exited with return code 1
sed: can't read /var/lib/lxc/android/rootfs/init.manta.usb.rc: No such file or directory
cp: cannot stat '/var/lib/lxc/android/overrides/*': No such file or directory
mkfifo: cannot create fifo '/dev/socket/micshm': File exists

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Assuming this is running upstart (as it looks like), try adding the debug and verbose flags as shown in the upstart cookbook?

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in lxc (Ubuntu):
status: New → Confirmed
Revision history for this message
TarotChen (tarotchen) wrote :

Still the latest arale-devel image-r303 is broken that fails to boot the phone

Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Can we get some traction on this? devel-proposed has been unbootable for too long. With xenial going into Final Freeze it would be nice to get this working around release-time. In touch we're not following the standard Ubuntu cycles, but it would be nice to have a bootable baseline when using xenial.

Revision history for this message
Ondrej Kubik (ondrak) wrote : Re: [Bug 1551150] Re: devel-proposed - android lxc container fails to start
Download full text (4.3 KiB)

so I did use new script for LXC
Now I can see that Android init process is started but it fails quite soon,
with not much logs indicating actual issue.
This is just speculation, but probably still issue how /proc and /dev are
mounted to container?
Here are all the logs I get from Android init:

Jan 1 02:02:38 ubuntu-phablet kernel: [ 10.107370] init: cannot find
'/system/etc/install-recovery.sh', disabling 'flash_recovery'

Jan 1 02:02:38 ubuntu-phablet kernel: [ 10.107523] init: cannot find
'/system/bin/ubuntuappmanager.disabled', disabling 'ubuntuappmanager'

Jan 1 02:02:38 ubuntu-phablet kernel: [ 10.112345] binder: 1064:1064
transaction failed 29189, size 0-0

Jan 1 02:02:38 ubuntu-phablet dbus[862]: [system] Successfully activated
service 'org.freedesktop.login1'

Jan 1 02:02:38 ubuntu-phablet kernel: [ 10.190172] pil pil4: gss: Failed
to locate gss.mdt

Jan 1 02:02:38 ubuntu-phablet kernel: [ 10.214710] pil pil4: gss: Failed
to locate gss.mdt

Jan 1 02:02:38 ubuntu-phablet kernel: [ 10.220051] pil pil4: gss: Failed
to locate gss.mdt

Jan 1 02:02:38 ubuntu-phablet kernel: [ 10.231191] pil pil4: gss: Failed
to locate gss.mdt

Jan 1 02:02:38 ubuntu-phablet kernel: [ 10.243003] init: using
deprecated syntax for specifying property 'ro.serialno', use ${name} instead

Jan 1 02:02:38 ubuntu-phablet kernel: [ 10.243277] init: using
deprecated syntax for specifying property 'ro.product.manufacturer', use
${name} instead

Jan 1 02:02:38 ubuntu-phablet kernel: [ 10.243552] init: using
deprecated syntax for specifying property 'ro.product.model', use ${name}
instead

Jan 1 02:02:38 ubuntu-phablet kernel: [ 10.250572] usbcore: registered
new interface driver rmnet_usb

Jan 1 02:02:38 ubuntu-phablet kernel: [ 10.257042] rmnet usb ctrl
Initialized.

Jan 1 02:02:38 ubuntu-phablet kernel: [ 10.347413] init: property
'sys.powerctl' doesn't exist while expanding '${sys.powerctl}'

Jan 1 02:02:38 ubuntu-phablet kernel: [ 10.347627] init: powerctl:
cannot expand '${sys.powerctl}'

Jan 1 02:02:38 ubuntu-phablet kernel: [ 10.347840] init: property
'sys.sysctl.extra_free_kbytes' doesn't exist while expanding
'${sys.sysctl.extra_free_kbytes}'

Jan 1 02:02:38 ubuntu-phablet kernel: [ 10.348023] init: cannot expand
'${sys.sysctl.extra_free_kbytes}' while writing to
'/proc/sys/vm/extra_free_kbytes'

On Fri, Apr 15, 2016 at 2:25 PM, Łukasz Zemczak <email address hidden>
wrote:

> Can we get some traction on this? devel-proposed has been unbootable for
> too long. With xenial going into Final Freeze it would be nice to get
> this working around release-time. In touch we're not following the
> standard Ubuntu cycles, but it would be nice to have a bootable baseline
> when using xenial.
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1551150
>
> Title:
> devel-proposed - android lxc container fails to start
>
> Status in Canonical System Image:
> Confirmed
> Status in lxc package in Ubuntu:
> Confirmed
>
> Bug description:
> Last known good image: ubuntu-touch/devel-proposed/ubuntu/mako 446
>
> ubuntu-touch/devel-proposed/ubuntu/mako 447 (and ...

Read more...

Revision history for this message
Jean-Baptiste Lallement (jibel) wrote :

 On devel-proposed/mako r487 the device boots with:
- lxc reverted to 2.0.0~rc2-0ubuntu2
- lxc-android-config.conf from comment 21.

Changed in canonical-devices-system-image:
milestone: backlog → xenial
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

The lxc and lxc-android-config packages have been re-reverted with lxc container once again failing to start - as per request. Would be good if we could get this properly fixed.

Revision history for this message
Stéphane Graber (stgraber) wrote :

Can this somehow be reproduced in a VM?

I don't have hardware (nor want any) and debugging over the bug tracker doesn't seem to be working very well :)

Revision history for this message
Stéphane Graber (stgraber) wrote :

So it's really unclear to me what's the actual problem, but I got a fix for the situation that I've tested on a mako here.

Revision history for this message
Stéphane Graber (stgraber) wrote :

Basically it looks like that "something" doesn't like the cgroups being mounted on the phone. Immediately after udev starts, every process gets a SIGKILL and the phone goes down.

We do need the cgroups filesystems to be mounted for LXC to be happy though, but we don't need them visible after that (since cgmanager is running anyway).

So my workaround is to tweak lxc-android-config.conf a bit, first to make it start the container in the background which lets us get useful debug output in the upstart log should it fail again, second, shuffle the cgroup init code a bit in there (based on what I pasted in this bug before) and then adding a bit of code which unmounts all the cgroup filesystems right before emitting the android event.

New lxc-android-config.conf is attached. Tested on clean mako using ubuntu-touch/devel-proposed channel.

Revision history for this message
Stéphane Graber (stgraber) wrote :
Changed in lxc (Ubuntu):
status: Confirmed → Invalid
Revision history for this message
Stéphane Graber (stgraber) wrote :

Marking the LXC task as Invalid since there's nothing inherently wrong in LXC.

Revision history for this message
Jean-Baptiste Lallement (jibel) wrote :

I confirm that krillin and mako running ubuntu-touch/staging (xenial) boot with lxc-android-config.conf provided by Stéphane. Thanks!

Changed in canonical-devices-system-image:
status: Confirmed → In Progress
Changed in lxc-android-config (Ubuntu):
status: New → In Progress
importance: Undecided → Critical
Changed in lxc-android-config (Ubuntu):
assignee: nobody → Ondrej Kubik (w-ondra)
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package lxc-android-config - 0.230+16.10.20160616-0ubuntu1

---------------
lxc-android-config (0.230+16.10.20160616-0ubuntu1) yakkety; urgency=medium

  [ Ondrej Kubik ]
  * Updating lxc-android-config to work with new lxc v2.0. This fixes
    boot of devel-proposed (Xenial) channel (LP: #1551150)

 -- Łukasz Zemczak <email address hidden> Thu, 16 Jun 2016 13:38:18 +0000

Changed in lxc-android-config (Ubuntu):
status: In Progress → Fix Released
Changed in canonical-devices-system-image:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers