upstart can race with cgmanager when using remove-on-empty

Bug #1357252 reported by Timo Jyrinki on 2014-08-15
26
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Ubuntu Application Launcher
Invalid
Critical
Ted Gould
upstart
Undecided
James Hunt
cgmanager (Ubuntu)
Undecided
Unassigned
cgmanager (Ubuntu RTM)
Undecided
James Hunt
systemd-shim (Ubuntu)
Undecided
Stéphane Graber
upstart (Ubuntu)
Undecided
James Hunt
upstart (Ubuntu RTM)
Undecided
Unassigned

Bug Description

After the ubuntu-app-launch landing in #188 (http://people.canonical.com/~ogra/touch-image-stats/188.changes), UI Toolkit autopilot tests have started failing, see http://ci.ubuntu.com/smokeng/utopic/touch/mako/190:20140815:20140811.1/9676/ubuntuuitoolkit/ and the #188 results too. It's different tests each time, but a bit above 10 of them tend to fail on each full run.

Reproducable locally, where I got 11 such failures.

To reproduce:
[terminal 1]
1. flash/update device if needed
2. adb shell
3. apt install ubuntu-ui-toolkit-autopilot
4. powerd-cli display on bright # leave running, also unlock the screen so that lenses are shown
[terminal 2]
5. phablet-config autopilot --dbus-probe enable # wait until finishes
6. phablet-test-run ubuntuuitoolkit

A full test suite run will take around 0.5h and should produce a bunch of 'Application failed to start.':s.

Related branches

Ted Gould (ted) on 2014-08-15
Changed in ubuntu-app-launch:
status: New → Triaged
importance: Undecided → Critical
assignee: nobody → Ted Gould (ted)
Ted Gould (ted) wrote :

I think this might be related to an error that Chipaca was seeing where Upstart wasn't able to make the cgroup for the task. Putting an Upstart task on this bug.

Ted Gould (ted) wrote :

The really odd thing to me as, even though there is a billion application log files, there's never one for the jobs that failed. Which means it never got to the point of trying to run.

James Hunt (jamesodhunt) wrote :

Currently, there is a minor bug whereby if upstart is running in debug mode and trying to start a job that uses the 'cgroup' stanza will dump cgroup debug details into the jobs log [1].

This could be useful here though since it should be possible to either boot the session init in debug mode and see what turns up in ~/.cache/upstart/*.log. I've attached a patch to enable debug mode at session init startup but you can always enable at any time using 'initctl log-priority debug'.

It would also be worth enabling debug mode for cgmanager by tweaking /etc/init/cgmanager.conf and setting 'cgmanager_opts="--debug"' in that file as shown (or simply creating /etc/init/cgmanager.override containing 'env cgmanager_opts="--debug"'.

[1] - MP to resolve this is: https://code.launchpad.net/~jamesodhunt/upstart/make-cgroups-quiet-in-debug-mode/+merge/230762

James Hunt (jamesodhunt) wrote :

I'm getting extra debug added to cgmanager but it would be useful if we had some logs as "application failed to start" is pretty high-level".

James Hunt (jamesodhunt) wrote :

@Timo - could cgmanager be run in debug mode in the autopilot environment? Then /var/log/upstart/cgmanager.log could be usefully added as a test artifact. Same logic for the session init such that ~/.xsession-errors could also be added.

James Hunt (jamesodhunt) wrote :

Created pull request on upstream cgmanager project to add further debug details in case of cgroup setup failure:

https://github.com/cgmanager/cgmanager/pull/15

Timo Jyrinki (timo-jyrinki) wrote :

I've asked psivaa to try out the configuration changes on a mako that produces dashboard results, to possibly capture more logs.

description: updated
description: updated
Timo Jyrinki (timo-jyrinki) wrote :

UITK tests give the failure most trustworthily (possibly because they are so wide in scope and launch things a lot), but this has also been seen elsewhere:

http://ci.ubuntu.com/smokeng/utopic/touch/mako/188:20140814:20140811.1/9648/music_app/1530204/
http://ci.ubuntu.com/smokeng/utopic/touch/mako/189:20140814.1:20140811.1/9674/messaging_app/1532527/
http://ci.ubuntu.com/smokeng/utopic/touch/mako/190:20140815:20140811.1/9676/messaging_app/1534759/

Similar to UITK, the test failing varies from run to run. Looking further into the history of music & messaging app logs, there were no such failures before #188.

tags: added: patch
tags: added: lt-blocker lt-category-visible lt-prio-high
Ted Gould (ted) wrote :

Marking as incomplete for UAL as it appears to be an Upstart bug. Will still watch though for updates :-)

Changed in ubuntu-app-launch:
status: Triaged → Incomplete
summary: - "Application failed to start." during autopilot tests after the newest
- ubuntu-app-launch landing
+ Upstart jobs specifying cgroup fail to start occasionally

The debug enabled logs for cgmanager (from /var/log/upstart/cgmanager.lg) is http://paste.ubuntu.com/8080145/plain/

James Hunt (jamesodhunt) on 2014-08-18
Changed in upstart:
assignee: nobody → James Hunt (jamesodhunt)
importance: Undecided → Critical
James Hunt (jamesodhunt) wrote :

@psivaa: thanks - please can you attach the full log as this seems to be just the error lines?

Stéphane Graber (stgraber) wrote :

On an affected system, can you please post the following:
 - cat /proc/mounts
 - cat /proc/cgroups
 - cat /proc/self/cgroup
 - cat /proc/$(pidof cgmanager)/cgroup
 - cat /proc/$(pidof cgmanager)/mounts
 - find /proc/$(pidof cgmanager)/root/run/cgmanager/

James Hunt (jamesodhunt) wrote :

I have so far been unable to recreate any cgroup failures at the job level: Upstart is correctly handling the requests and passing them to cgmanager resulting in job processes being contained as expected (even under heavy load).

I've just run 'phablet-test-run ubuntuuitoolkit' and this resulted in no 'Invalid path' cgmanager or upstart issues. Will try re-running...

Serge Hallyn (serge-hallyn) wrote :

In the logfile the cgroup name character string seems to always be cut off at 127. I tested cgmanager and cgproxy with creating and moving tasks into cgroups with longer names, but that succeeded.

Is it possible that upstart is cutting off the name at 127?

Serge Hallyn (serge-hallyn) wrote :

@James - sorry, didn't see comment #13 (it takes more than 2 mins to post a comment from here :)

So I would suspect UAL is truncating a path.

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in cgmanager (Ubuntu):
status: New → Confirmed

I reproduced this issue with 0.4+14.10.20140808.2.is.0.4+14.10.20140808.2-0ubuntu1

The output of the 'cat' commands are attached in _proc_1357252.out

When it happens the following lines are displayed in:
= unity8-dash.log=
Trying to open "application:///dialer-app.desktop"

= unity8.log =
qtmir.applications: ApplicationManager::onProcessFailed - appId= "dialer-app" duringStartup= true
ApplicationManager::onProcessFailed - upstart reports failure of application "dialer-app" that AppManager is not managing
qtmir.applications: ApplicationManager::onProcessStopped - appId= "dialer-app"
ApplicationManager::onProcessStopped reports stop of appId= "dialer-app" which AppMan is not managing, ignoring the event

= /var/log/upstart/cgmanager.log =
got creds pid 26429 (32011:32011)
cgmanager:do_create_main: pid 26429 (uid 32011 gid 32011) may not create under /run/cgmanager/fs/freezer

Timo Jyrinki (timo-jyrinki) wrote :
James Hunt (jamesodhunt) wrote :

The 'may not create under /run/cgmanager/fs/freezer' errors are caused by the client (session init) cgmanager request "falling through" may_access() in cgmanager.

Here's some debug first from upstart attempting to start a cgroup job by first creating the cgroup:

init:cgroup_create:1172: cgmanager_create_sync(controller='freezer', path='upstart/cgroup-ted-ted') returned -1

This operation failed. Details from the cgmanager server for the cgmanager_create_sync() request:

do_create_main: do_create_main:263: controller='freezer', cgroup='upstart/cgroup-ted-ted', ucred p=(pid=1426, uid=0, gid=0), ucred r=(pid=3786, uid=32011, gid=32011)
do_create_main: do_create_main:315: r.pid=3786, r.uid=32011, r.gid=32011, dirpath='/run/cgmanager/fs/freezer'
may_access: may_access:925: uid=32011, gid=32011, path='/run/cgmanager/fs/freezer', mode=2
may_access: may_access:933
may_access: may_access:939
may_access: may_access:951
may_access: may_access:963
may_access: may_access:975
may_access: may_access:980
may_access: may_access:985
may_access: may_access:990
cgmanager:do_create_main: pid 3786 (uid 32011 gid 32011) may not create under /run/cgmanager/fs/freezer

James Hunt (jamesodhunt) wrote :

Patch for serge that applies to lp:ubuntu/cgmanager and which was used to collect output in #20.

tags: removed: patch
tags: added: lt-whitelisted
James Hunt (jamesodhunt) wrote :

There is definitely an upstart bug here as I have a child that has failed to setup cgroups successfully and is spinning, attempting to notify the parent via JOB_PROCESS_ERROR_CGROUP_SETUP. The only reason it is spinning could be that the other end of the pipe is closed, but that shouldn't be possible.

This needs further investigation, but is a side-effect of the fact that cgroup setup is failing due to the cgmanger denial shown in #20.

tags: added: patch
Serge Hallyn (serge-hallyn) wrote :

Hi James,

regarding comment #20,

1. access is being denied for uid 32011 to get O_RDWR to /run/cgmanager/fs/freezer. I'd expect that to be denied.

2. In either case, could you add to your may_access debug statements a printout of the sb.st_uid, sb.st_gid, and st.mode?

Leo Arias (elopio) on 2014-08-20
tags: added: qa-daily-testing regression
James Hunt (jamesodhunt) wrote :

Hi Serge,

Attached are upstart session job output and cgmanager output showing "good" (desktop utopic in kvm) and "bad" (device) runs:

Note the following:

- The session init in both cases is making the call on cgmanager as a
  non-priv user in an identical way:

    cgmanager_create_sync(controller='freezer', path='upstart_cgroup')

- However, once that gets across the wire, we have differences:

The good run shows:

do_create_main:263: controller='freezer', cgroup='upstart_cgroup', ucred p=(pid=2948, uid=1000, gid=1000), ucred r=(pid=2948, uid=1000, gid=1000)
do_create_main:318: r.pid=2948, r.uid=1000, r.gid=1000, dirpath='/run/cgmanager/fs/freezer/user.slice/user-1000.slice/session-c2.scope'

The bad run shows:

do_create_main:263: controller='freezer', cgroup='upstart_cgroup', ucred p=(pid=638, uid=0, gid=0), ucred r=(pid=3554, uid=32011, gid=32011)
do_create_main:318: r.pid=3554, r.uid=32011, r.gid=32011, dirpath='/run/cgmanager/fs/freezer'

= Differences =

== Path ==

As you picked up previously, upstart never requested that cgmanager
create '/run/cgmanager/fs/freezer', but for some reason that's the
request that arrives. The correct path should be
'/run/cgmanager/fs/freezer/user.slice/user-1000.slice/session-c2.scope'

== uid ==

Bad run shows "uid=0" and "gid=0" - this looks incorrect.

= Possible culprits =

== cgproxy ==

This is running on the device along with cgmanager since the kernel
version is 3.4.67.

However, even when running in debug mode, the entire contents of
/var/log/upstart/cgproxy.log is:

--------------------
Connection from private client
Create: Client fd is: 8 (pid=3554, uid=32011, gid=32011)
cgmanager_create: returning -1; existed is -1
Disconnected from private client
--------------------

== Other possibilities ==

- kernel
- logind?

James Hunt (jamesodhunt) wrote :

The simplest way to see the problem (the following works as expected on a desktop system, creating the cgroup and resulting in no error output):

$ id
uid=32011(phablet) gid=32011(phablet) groups=32011(phablet),4(adm),5(tty),20(dialout),24(cdrom),27(sudo),30(dip),44(video),46(plugdev),114(autopilot),1001(radio),1002(bluetooth),1003(android_graphics),1004(android_input),1005(audio),1013(android_media),1015(sdcard_rw),1021(gps),2001(android_cache),3002(android_net3),3003(android_net),3004(android_net2),9997(android_nvram)
$ cgm create freezer freeze-me
call to cgmanager_create_sync failed: invalid request

And here's the cgmanager.log entry for the above request:

got creds pid 12565 (32011:32011)
do_create_main: do_create_main:263: controller='freezer', cgroup='freeze-me', ucred p=(pid=638, uid=0, gid=0), ucred r=(pid=12565, uid=32011, gid=32011)
do_create_main: do_create_main:273
do_create_main: do_create_main:279
do_create_main: do_create_main:288
do_create_main: do_create_main:318: r.pid=12565, r.uid=32011, r.gid=32011, dirpath='/run/cgmanager/fs/freezer'
may_access: may_access:925:uid=32011, gid=32011, path='/run/cgmanager/fs/freezer', mode=2
may_access: may_access:931: (O_RDONLY=0, O_RDWR=2, O_WRONLY=1)
may_access: may_access:944:sb.st_uid=0, sb.st_gid=0, sb.st_mode=16877 (0x41ed)
may_access: may_access:950
may_access: may_access:962
may_access: may_access:988
may_access: may_access:1014
may_access: may_access:1019
may_access: may_access:1024
may_access: may_access:1029
cgmanager.jodh:do_create_main: pid 12565 (uid 32011 gid 32011) may not create under /run/cgmanager/fs/freezer

James Hunt (jamesodhunt) on 2014-08-20
Changed in upstart:
status: New → In Progress

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1357252

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
James Hunt (jamesodhunt) on 2014-08-20
no longer affects: linux (Ubuntu)

We definitely have a race here. I can now reliably create either a broken system or a working system simply by delaying the start of cgmanager slightly:

- With the cgmanager startup delay, delay everything works correctly.

- With no cgmanager startup delay, the session init ends up in the root cgroups rather than the correct logind-generated
  "user.slice/user-$uid.slice/session-$session.scope".

tags: removed: lt-whitelisted
James Hunt (jamesodhunt) wrote :

This still needs further investigation bug I think what's happening is that cgmanager is racing with dbus. And since other jobs refer to cgmanager they end up starting too early, resulting in the possibility that the session init starts too early (before it's been cgroup-constrained).

If correct, he solution to this problem is simple - change cgmanagers 'start on' condition to:

    start on started dbus

Currently, cgmanager *may* start before D-Bus because the 'start on' condition is optimised to start cgmanager as soon as *its* requirements are met. However, this seems nonsensical in this scenario because although cgmanager is starting as early as it can, at the point it starts D-Bus may not be running so clients cannot actually make use of cgmanager yet!

Please could those affected by this problem test on their hardware to see if they can still recreate the problem after running the following:

$ sudo mount -oremount,rw /
$ echo "start on started dbus" | sudo tee -a /etc/init/cgmanager.override
$ sudo reboot

On a N4, I did (quoted from irc by jodh)

 'adb shell', su - phablet, create ~/.config/upstart/cgroup.conf
 containing http://paste.ubuntu.com/8108361/ and try 'start cgroup' and
 see what happens

it appears correct. I also added a 'start on started cgmanager', rebooted
the phone, then .cache/upstart/cgroup.log contained:

phablet@ubuntu-phablet:~$ cat .cache/upstart/cgroup.log
5:name=systemd:/user.slice/user-32011.slice/session-c2.scope
4:freezer:/user.slice/user-32011.slice/session-c2.scope/upstart_cgroup
3:debug:/user.slice/user-32011.slice/session-c2.scope
2:cpuacct:/user.slice/user-32011.slice/session-c2.scope
1:cpu:/user.slice/user-32011.slice/session-c2.scope

Also just to clarify some of the above comments.

cgmanager does offer a dbus API but it does that on a private socket, it DOESN'T use the dbus system bus or dbus-daemon in any way, so there's absolutely no reason why cgmanager should ever wait for dbus to be started.

James Hunt (jamesodhunt) wrote :

After a lot of testing, I think I've found the cause - when lightdm starts the session init, logind is supposed to setup the cgroups. But that isn't happening.

Normally, logind would do that by asking systemd, but in our case it talks to systemd-shim. sytemd-shim then asks cgmanager to create the cgroups. However, currently systemd-shim does not fail if cgmanager cannot be contacted. So I think the proper fix is to change systemd-shim to either block or fail hard if cgmanager is not contactable or unable to create the requested cgroups. It would also be beneficial to arrange for systemd-shim output to be captured since currently all stdout and stderr is discarded when it is run via /usr/share/dbus-1/system-services/org.freedesktop.systemd1.service.

A "quick fix" however is to change dbus's 'start on' condition to ensure it doesn't start until cgmanager is ready. The simplest fix being:

$ echo "start on cgmanager-ready" | sudo tee -a /etc/init/dbus.override

James Hunt (jamesodhunt) on 2014-08-22
Changed in upstart:
status: In Progress → Invalid
importance: Critical → Undecided
James Hunt (jamesodhunt) on 2014-08-22
Changed in systemd-shim (Ubuntu):
assignee: nobody → Stéphane Graber (stgraber)
summary: - Upstart jobs specifying cgroup fail to start occasionally
+ systemd-shim fails to handle cgmanager being unavailable

This bug was fixed in the package cgmanager - 0.30-1ubuntu1

---------------
cgmanager (0.30-1ubuntu1) utopic; urgency=medium

  * cgmanager.upstart: start on || starting dbus, to make sure dbus clients
    start after cgmanager is ready (LP: #1357252)
 -- Serge Hallyn <email address hidden> Tue, 26 Aug 2014 12:16:51 -0500

Changed in cgmanager (Ubuntu):
status: Confirmed → Fix Released
Martin Pitt (pitti) on 2014-08-27
Changed in systemd-shim (Ubuntu):
status: New → Invalid
Changed in ubuntu-app-launch:
status: Incomplete → Invalid
James Hunt (jamesodhunt) on 2014-08-29
summary: - systemd-shim fails to handle cgmanager being unavailable
+ upstart can race with cgmanager when using remove-on-empty
Changed in upstart:
status: Invalid → In Progress
James Hunt (jamesodhunt) wrote :

From @teds findings:

"The Upstart job creates the cgroup as it goes through each phase in the job, but it sets the cgroup to be destroyed when all of its PIDs are gone. When starting the second or later task (what ever that might be, seems to be post-stop a lot though) there is then a race between starting that task and the cgroup being destroyed if all of the PIDs in the group have exited with the main job. On entering the function to create the task the group is created, and if right then cgmanager realizes the PIDs are gone it deletes it before the new PID can be added."

The solution is to not mark created cgroups 'remove-on-empty' immediately after they are created, but instead to request cgmanager mops up the cgroups after the last job processes cgroups have been created.

Note that this approach works correctly for system jobs which specify the cgroup stanza. However, it is not currently reliable for session jobs since because logind does not clean up cgroups itself, systemd-shim arranges for this to happen on behalf of the users session by marking the logind-created cgroups 'remove-on-empty' and letting cgmanager clean up.... which brings us back to the problem which the upstart changes are designed to fix.

So, further works is required to resolve this problem by enhancing systemd-shim to implement 'abandoncgroup' and 'stopsession' so that logind works as expected.

James Hunt (jamesodhunt) wrote :

Bug 1363134 raised for the systemd-shim work.

James Hunt (jamesodhunt) on 2014-09-05
Changed in upstart:
status: In Progress → Fix Released
Changed in upstart (Ubuntu):
assignee: nobody → James Hunt (jamesodhunt)
status: New → Fix Released
Steve Langasek (vorlon) on 2014-09-06
Changed in cgmanager (Ubuntu RTM):
status: New → Triaged
assignee: nobody → James Hunt (jamesodhunt)
Changed in upstart (Ubuntu RTM):
status: New → Triaged
Changed in cgmanager (Ubuntu RTM):
status: Triaged → Fix Released
Steve Langasek (vorlon) wrote :

Note that these fixes need to get into the Ubuntu RTM distro as well. cgmanager 0.30-1ubuntu1 is already there (https://launchpad.net/ubuntu-rtm/+source/cgmanager), but upstart is currently at version 1.13.1-0ubuntu3 vs. 1.13.2-0ubuntu1 in utopic.

James, please get the necessary changes landed into ubuntu-rtm/14.09. If you have questions on how to do this, check with a member of the landing team (e.g. ogra).

Colin Watson (cjwatson) wrote :

Can we close the ubuntu-rtm/upstart task now? https://launchpad.net/ubuntu-rtm/+source/upstart shows 1.13.2~rtm-0ubuntu1 (a spurious version due to CI Train thinking it owns upstart upstream releases, but the effect should be the same).

Steve Langasek (vorlon) wrote :

Yes, closing.

Changed in upstart (Ubuntu RTM):
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers