systemd-sysusers cannot mount /dev in privileged containers (to pass credentials)

Bug #1950787 reported by Lukas Märdian
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxd (Ubuntu)
systemd (Ubuntu)

Bug Description

systemd-sysusers.service/systemd.exec fails to start in privileged containers, due to being unable to properly mount /dev for passing credentials, caused by the following config in the .service unit:
# Optionally, pick up a root password and shell for the root user from a
# credential passed to the service manager. This is useful for importing this
# data from nspawn's --set-credential= switch.

$ lxc profile set default security.privileged "true"
$ lxc launch ubuntu-daily:jammy test
$ lxc exec test bash
# add-apt-repository ppa:ci-train-ppa-service/4704
# apt install systemd # install systemd 249.5-2ubuntu1
# systemctl restart systemd-sysusers
# systemctl status systemd-sysusers
# system --status=failed
$ lxc profile set default security.privileged "false"

A workaround is to disable it via:
$ cat /etc/systemd/system/systemd-sysusers.service.d/override.conf:

Interesting logs:
Nov 12 12:09:44 test systemd[1]: systemd-journald.service: Added fd 42 (n/a) to fd store.
Nov 12 12:09:44 test systemd[431]: Mounting /dev (MS_REC|MS_SLAVE "")...
Nov 12 12:09:44 test systemd[431]: Failed to mount n/a (type n/a) on /dev (MS_REC|MS_SLAVE ""): Permission denied
Nov 12 12:09:44 test systemd[430]: (sd-mkdcreds) failed with exit status 1.
Nov 12 12:09:44 test systemd[430]: systemd-sysusers.service: Failed to set up credentials: Protocol error
Nov 12 12:09:44 test systemd[430]: systemd-sysusers.service: Failed at step CREDENTIALS spawning

Revision history for this message
Lukas Märdian (slyon) wrote :
description: updated
description: updated
Revision history for this message
Lukas Märdian (slyon) wrote (last edit ):

This commit seems to be related: But why does it not work in privileged containers?

Revision history for this message
Stéphane Graber (stgraber) wrote :

Privileged containers have a much stricter apparmor policy applied than unprivileged containers.
That's because unprivileged containers primarily rely on the user namespace to prevent breakout and taking over of the host whereas privileged containers rely entirely on apparmor.

As apparmor isn't particularly good at dealing with mounts, especially with mount namespaces, there is no safe way for us to allow this operation in privileged containers.

As you point out above, we've recently started using a systemd generator to dynamically generate unit overrides based on the environment, letting us disable specific features that interfere with container security.

This is used in all of the community images, so in this case you could try it by using "images:ubuntu/jammy" instead of "ubuntu-daily:jammy". We've been considering getting the generator into the lxd-agent-loader package which is included in all Ubuntu images though so far we've found it to be too volatile for that (we were updating it up to twice a week for a while...).

Revision history for this message
Stéphane Graber (stgraber) wrote :

If this only fails in privileged containers, then I probably wouldn't worry about it too much, those aren't the default and a LOT of things break in privileged containers, so I don't think it's worth doing distro changes to accommodate this, assuming the container otherwise still boots.

For cases like this one, it's usually been hard to make a solid case for a change of behavior in upstream systemd. There are a few places like the devices cgroup where permission errors are considered non-fatal which then accommodates containers quite well, but the same isn't true with the isolation security features which this one ties into.

In an ideal world, AppArmor would allow us to craft a policy which:
 - Allows for mount namespaces
 - Allows for bind-mounts of restricted paths
 - Applies the parent's policy onto the bind-mount target
 - Properly support mount propagation flags in a way that can't be abuse to allow all mounts

But as it stands, AppArmor is entirely path based, so a policy that applies to /proc will not apply to /proc bind-mount to /blah/proc (which is effectively what systemd does) and so causes all confinement to be bypassable. Additionally, there are (or were in some versions at least) issues with processing those mount propagation flags you see in your log (shared/slave/...) and allowing a bind-mount to be marked using one of those flags would incorrectly cause the parser or the kernel (not quite sure which) to allow ALL mounts...

Revision history for this message
Stéphane Graber (stgraber) wrote :

Closing the LXD task as there's not really anything we can do there.

The options here are pretty much:
 - Do nothing, if it's just privileged containers, it's usually not a big deal
 - Significantly rework apparmor mount handling logic and policies so this can be safely allowed
 - Ship unit overrides, either though lxd-agent-loader, through a systemd patch or a similar distro mechanism

Closing the LXD task as there currently isn't any change we can make to our policies to safely allow this.

Changed in lxd (Ubuntu):
status: New → Invalid
Revision history for this message
Lukas Märdian (slyon) wrote :

I've implemented the workaround in systemd's debian/test/tests-in-lxd.

Changed in systemd (Ubuntu):
status: New → Fix Committed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments