Unprivileged LXC containers don't work under systemd

Bug #1346734 reported by Martin Pitt on 2014-07-22
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
systemd (Ubuntu)
Medium
Martin Pitt

Bug Description

With systemd 208, unprivileged containers stop working when running under systemd (working fine under upstart with cgmanager). Quoting Stephane Graber:

In this setup, things don't work nearly as well. On login I'm only
placed into the name=systemd cgroup and not in any of the others, which
means that unprivileged LXC isn't usable.

Martin suggested setting JoinControllers in /etc/systemd/system.conf but
upon closer inspection, this isn't at all what we want. This setting is
used to tell systemd what controllers to co-mount, by default this is
set to cpu,cpuset (which caused the earlier cgmanager breakage).

Even though this option isn't helpful for what we want (i.e. setting the
list of cgroup controllers the first PID of a user session should be
added to), we should nonetheless set it to an empty string which should
instruct systemd not to co-mount any controller, therefore giving us a
more reliable behavior (identical to what we have in the upstart world
and unlikely to confuse lxc and other stuff doing direct cgroup access).

Additionally, we need to find an equivalent to our good old
"Controllers" logind.conf option, or re-introduce it or just patch
logind so that it will always join all the controllers (similar to what
the shim does).

== Actions ==
 * Update systemd.conf to set JoinControllers to an empty value.
 * Make it so new user sessions are joined to all the available
   controllers by doing one of the following:
   - Find the magic undocumented config variable
   - Re-introduce the "Controllers" option in logind.conf
   - Patch logind to have it always join all available controllers

Martin Pitt (pitti) on 2014-07-22
tags: added: systemd-boot
Changed in systemd (Ubuntu):
status: New → Triaged
Martin Pitt (pitti) wrote :

I have an unprivileged container setup in my test VM now, and they continue to work with 208. However, LXC under systemd currently requires some work (bug 1312532 and bug 1350947), so this should land first so that system-level containers work under systemd. Then I'll look into the cgroups issue.

Stéphane, can I check this without LXC somehow? I think my session processes already are in all cgroups:

$ cat /proc/$$/cgroup
10:hugetlb:/
9:perf_event:/
8:blkio:/
7:net_cls,net_prio:/
6:freezer:/
5:devices:/
4:memory:/
3:cpu,cpuacct:/
2:cpuset:/
1:name=systemd:/user.slice/user-1000.slice/session-c2.scope

$ grep $$ /sys/fs/cgroup/*/cgroup.procs
/sys/fs/cgroup/blkio/cgroup.procs:2898
/sys/fs/cgroup/cpuacct/cgroup.procs:2898
/sys/fs/cgroup/cpu/cgroup.procs:2898
/sys/fs/cgroup/cpu,cpuacct/cgroup.procs:2898
/sys/fs/cgroup/cpuset/cgroup.procs:2898
/sys/fs/cgroup/devices/cgroup.procs:2898
/sys/fs/cgroup/freezer/cgroup.procs:2898
/sys/fs/cgroup/hugetlb/cgroup.procs:2898
/sys/fs/cgroup/memory/cgroup.procs:2898
/sys/fs/cgroup/net_cls/cgroup.procs:2898
/sys/fs/cgroup/net_cls,net_prio/cgroup.procs:2898
/sys/fs/cgroup/net_prio/cgroup.procs:2898
/sys/fs/cgroup/perf_event/cgroup.procs:2898

Or do I misunderstand this?

Martin Pitt (pitti) wrote :
Changed in systemd (Ubuntu):
importance: Undecided → Medium
Martin Pitt (pitti) wrote :

For my own notes: No hints from upstream; my current theory is that the best place to hook this in would be in src/core/service.c service_spawn(): After a successful exec_spawn(), if the unit is a *.scope, also put it into all other cgroup controlles (cg_create() and cg_attach()).

Martin Pitt (pitti) on 2014-11-19
Changed in systemd (Ubuntu):
milestone: none → ubuntu-14.12
assignee: nobody → Martin Pitt (pitti)
Martin Pitt (pitti) wrote :
Download full text (4.7 KiB)

I created a per-user container "t1", and confirm that it does start under upstart/cgmanger and doesn't under systemd. I now have a preliminary patch for putting the user slices into all cgroup controllers, plus some hand-crafted "chown ubuntu" for all the user-1000.slice cgroup directories so that they become writable (this part still needs to be added to the patch). I understand that this should now be sufficient:

ubuntu@ulxc$ cat /proc/$$/cgroup
10:devices:/user.slice/user-1000.slice
9:memory:/user.slice/user-1000.slice
8:cpuset:/
7:hugetlb:/user.slice/user-1000.slice
6:blkio:/user.slice/user-1000.slice
5:cpu,cpuacct:/user.slice/user-1000.slice
4:freezer:/user.slice/user-1000.slice
3:perf_event:/user.slice/user-1000.slice
2:net_cls,net_prio:/user.slice/user-1000.slice
1:name=systemd:/user.slice/user-1000.slice/session-1.scope

ubuntu@ulxc:~$ ls -ld /sys/fs/cgroup/*/user.slice/user-1000.slice/
drwxr-xr-x 2 ubuntu root 0 Nov 26 10:41 /sys/fs/cgroup/blkio/user.slice/user-1000.slice/
drwxr-xr-x 2 ubuntu root 0 Nov 26 10:41 /sys/fs/cgroup/cpuacct/user.slice/user-1000.slice/
drwxr-xr-x 2 ubuntu root 0 Nov 26 10:41 /sys/fs/cgroup/cpu,cpuacct/user.slice/user-1000.slice/
drwxr-xr-x 2 ubuntu root 0 Nov 26 10:41 /sys/fs/cgroup/cpuset/user.slice/user-1000.slice/
drwxr-xr-x 2 ubuntu root 0 Nov 26 10:41 /sys/fs/cgroup/cpu/user.slice/user-1000.slice/
drwxr-xr-x 2 ubuntu root 0 Nov 26 10:41 /sys/fs/cgroup/devices/user.slice/user-1000.slice/
drwxr-xr-x 2 ubuntu root 0 Nov 26 10:41 /sys/fs/cgroup/freezer/user.slice/user-1000.slice/
drwxr-xr-x 2 ubuntu root 0 Nov 26 10:41 /sys/fs/cgroup/hugetlb/user.slice/user-1000.slice/
drwxr-xr-x 2 ubuntu root 0 Nov 26 10:41 /sys/fs/cgroup/memory/user.slice/user-1000.slice/
drwxr-xr-x 2 ubuntu root 0 Nov 26 10:41 /sys/fs/cgroup/net_cls,net_prio/user.slice/user-1000.slice/
drwxr-xr-x 2 ubuntu root 0 Nov 26 10:41 /sys/fs/cgroup/net_cls/user.slice/user-1000.slice/
drwxr-xr-x 2 ubuntu root 0 Nov 26 10:41 /sys/fs/cgroup/net_prio/user.slice/user-1000.slice/
drwxr-xr-x 2 ubuntu root 0 Nov 26 10:41 /sys/fs/cgroup/perf_event/user.slice/user-1000.slice/
drwxr-xr-x 4 root root 0 Nov 26 10:33 /sys/fs/cgroup/systemd/user.slice/user-1000.slice/

I'm not sure why my login shell isn't in "cpuset", I'll debug that still. But I chown'ed /sys/fs/cgroup/cpuset/ to "ubuntu" as well.

But still lxc-start fails:

$ lxc-start -n t1 -F
lxc-start: cgfs.c: lxc_cgroupfs_create: 849 Could not set clone_children to 1 for cpuset hierarchy in parent cgroup.
lxc-start: cgfs.c: cgroup_rmdir: 207 Permission denied - cgroup_rmdir: failed to delete /sys/fs/cgroup/devices/user.slice/user-1000.slice
lxc-start: cgfs.c: cgroup_rmdir: 207 Permission denied - cgroup_rmdir: failed to delete /sys/fs/cgroup/memory/user.slice/user-1000.slice
lxc-start: cgfs.c: cgroup_rmdir: 207 Permission denied - cgroup_rmdir: failed to delete /sys/fs/cgroup/cpuset//user.slice/user-1000.slice
lxc-start: cgfs.c: cgroup_rmdir: 207 Permission denied - cgroup_rmdir: failed to delete /sys/fs/cgroup/cpuset//user.slice
lxc-start: cgfs.c: cgroup_rmdir: 207 Read-only file system - cgroup_rmdir: failed to delete /sys/fs/cgroup/cpuset/
lxc-start: cgfs.c: cgroup_rmdir: 207 Permission denied -...

Read more...

no longer affects: lxc (Ubuntu)
Martin Pitt (pitti) wrote :

Ah, nevermind; it wanted to write /sys/fs/cgroup/cpuset//cgroup.clone_children, which is probably an artifact of cpuset not being included in the "join all controllers" bits.

Martin Pitt (pitti) on 2014-11-26
Changed in systemd (Ubuntu):
status: Triaged → In Progress
Martin Pitt (pitti) wrote :

Got it working now, with the patch set on http://people.canonical.com/~pitti/tmp/systemd-unpriv-lxc/

Martin Pitt (pitti) wrote :

The above patches are included in https://launchpad.net/ubuntu/+source/systemd/215-6ubuntu2, but they still don't work quite right: They seem to work well through VT logins and ssh, but not through lightdm. There's some race condition somewhere which removes PIDs from the session cgroup controllers again and moves them back to either or /user.slice.

Martin Pitt (pitti) wrote :
Changed in systemd (Ubuntu):
status: In Progress → Fix Committed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package systemd - 217-2ubuntu1

---------------
systemd (217-2ubuntu1) vivid; urgency=medium

  * Merge with Debian unstable. See 217-1ubuntu1 for remaining Ubuntu changes.
  * Put session scopes into all cgroup controllers instead of their parent
    user slices. This works better with killing sessions and is consistent
    with the "systemd" controller.
  * Do not realize and migrate cgroups multiple times, in particular
    "-.slice". This fixes PIDs in non-systemd cgroup controllers to be
    randomly migrated back to /. (LP: #1346734)
  * boot-and-services autopkgtest: Give test apparmor job some time to
    actually finish.

systemd (217-2) experimental; urgency=medium

  * Re-enable journal forwarding to syslog, until Debian's sysloggers
    can/do all read from the journal directly.
  * Fix hostnamectl exit code on success.
  * Fix "diff failed with error code 1" spew with systemd-delta.
    (Closes: #771397)
  * Re-enable systemd-resolved. This wasn't meant to break the entire
    networkd, just disable the new NSS module. Remove that one manually
    instead. (Closes: #771423, LP: #1397361)
  * Import v217-stable patches (up to commit bfb4c47 from 2014-11-07).
  * Disable AppArmor again. This first requires moving libapparmor to /lib
    (see #771667). (Closes: #771652)
  * systemd.bug-script: Capture stderr of systemd-{delta,analyze}.
    (Closes: #771498)
 -- Martin Pitt <email address hidden> Mon, 01 Dec 2014 17:17:30 +0100

Changed in systemd (Ubuntu):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers