Services (apparmor, snapd.seeded, ...?) fail to start in nested lxd container

Bug #1905493 reported by Ian Johnson
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
AppArmor
New
Undecided
Unassigned
autopkgtest
New
Undecided
Unassigned
cloud-init
Invalid
Undecided
Unassigned
snapd
Confirmed
Low
Unassigned
dbus (Ubuntu)
Confirmed
Undecided
Unassigned
lxd (Ubuntu)
New
Undecided
Unassigned
systemd (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

When booting a nested lxd container inside another lxd container (just a normal container, not a VM) (i.e. just L2), using cloud-init -status --wait, the "." is just printed off infinitely and never returns.

Revision history for this message
Dan Watkins (oddbloke) wrote :

Hi Ian,

I've just launched such a container and I see a bunch of non-cloud-init errors in the log and when I examine `systemctl list-jobs`, I see that the two running jobs are systemd-logind.service and snapd.seeded.service:

root@certain-cod:~# systemctl list-jobs
JOB UNIT TYPE STATE
114 cloud-final.service start waiting
125 snapd.autoimport.service start waiting
143 systemd-update-utmp-runlevel.service start waiting
116 cloud-config.service start waiting
1 graphical.target start waiting
691 systemd-logind.service start running
99 unattended-upgrades.service start waiting
110 cloud-init.target start waiting
115 snapd.seeded.service start running
2 multi-user.target start waiting

10 jobs listed.

Examining the journal, I see that systemd-logind.service is in a restart loop:

root@certain-cod:~# journalctl -u systemd-logind.service | grep Failed\ w
Dec 01 22:37:43 certain-cod systemd[1]: systemd-logind.service: Failed with result 'timeout'.
Dec 01 22:39:13 certain-cod systemd[1]: systemd-logind.service: Failed with result 'timeout'.
Dec 01 22:40:44 certain-cod systemd[1]: systemd-logind.service: Failed with result 'timeout'.
Dec 01 22:42:14 certain-cod systemd[1]: systemd-logind.service: Failed with result 'timeout'.
Dec 01 22:43:44 certain-cod systemd[1]: systemd-logind.service: Failed with result 'timeout'.
Dec 01 22:45:14 certain-cod systemd[1]: systemd-logind.service: Failed with result 'timeout'.
Dec 01 22:46:45 certain-cod systemd[1]: systemd-logind.service: Failed with result 'timeout'.
Dec 01 22:48:15 certain-cod systemd[1]: systemd-logind.service: Failed with result 'timeout'.
Dec 01 22:49:45 certain-cod systemd[1]: systemd-logind.service: Failed with result 'timeout'.

This is blocking boot before cloud-init's later stages run, so as it is correctly indicating that it hasn't yet completed, I'm marking this Invalid for cloud-init. I'll add a systemd task instead, as that looks to be the source of the issue.

Cheers,

Dan

Changed in cloud-init:
status: New → Invalid
Revision history for this message
Dan Streetman (ddstreet) wrote :

The systemd-logind problem is due to dbus defaulting to apparmor mode 'enabled', but apparmor can't do much of anything inside a container so it fails to start, and dbus can't contact it.

In the 2nd level container, create a file like '/etc/dbus-1/system.d/no-apparmor.conf' with content:

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE busconfig PUBLIC
 "-//freedesktop//DTD D-BUS Bus Configuration 1.0//EN"
 "http://www.freedesktop.org/standards/dbus/1.0/busconfig.dtd">
<busconfig>
  <apparmor mode="disabled"/>
</busconfig>

Then restart the 2nd level container and recheck systemd-logind which should now work

Of course, fixing dbus should be a bit smarter about only disabling its use of apparmor if it's inside a container.

However, cloud-init status --wait still hangs after systemd-logind starts up, so that wasn't the original problem (or at least wasn't the only problem)

Changed in systemd (Ubuntu):
status: New → Invalid
Changed in cloud-init:
status: Invalid → New
Revision history for this message
Dan Watkins (oddbloke) wrote :

Given that the logind issue is an AppArmor issue and, per my previous comment, "the two running jobs are systemd-logind.service and snapd.seeded.service", I suspect that we'll find that snapd is running into similar sorts of issues. I'll take a quick look now.

Revision history for this message
Ian Johnson (anonymouse67) wrote :

FWIW I know what the snapd issue is, the issue is that snapd does not and will not work in a nested LXD container, we need to add code to make snapd.seeded.service die/exit gracefully in this situation.

Changed in snapd:
status: New → Confirmed
importance: Undecided → Low
Revision history for this message
Dan Watkins (oddbloke) wrote :

Yep, that's what I've found; cloud-init is just waiting for its later stages to run, which are blocked by snapd.seeded.service exiting.

Changed in cloud-init:
status: New → Invalid
Revision history for this message
Dan Streetman (ddstreet) wrote :

it's interesting that apparmor appears to work ok in the first-level container, but fails in the nested container, e.g.:

$ lxc shell lp1905493-f
root@lp1905493-f:~# systemctl status apparmor
● apparmor.service - Load AppArmor profiles
     Loaded: loaded (/lib/systemd/system/apparmor.service; enabled; vendor preset: enabled)
     Active: active (exited) since Wed 2021-03-17 18:17:44 UTC; 2h 53min ago
       Docs: man:apparmor(7)
             https://gitlab.com/apparmor/apparmor/wikis/home/
    Process: 118 ExecStart=/lib/apparmor/apparmor.systemd reload (code=exited, status=0/SUCCESS)
   Main PID: 118 (code=exited, status=0/SUCCESS)

Mar 17 18:17:44 lp1905493-f systemd[1]: Starting Load AppArmor profiles...
Mar 17 18:17:44 lp1905493-f apparmor.systemd[118]: Restarting AppArmor
Mar 17 18:17:44 lp1905493-f apparmor.systemd[118]: Reloading AppArmor profiles
Mar 17 18:17:44 lp1905493-f apparmor.systemd[129]: Skipping profile in /etc/apparmor.d/disable: usr.sbin.rsyslogd
Mar 17 18:17:44 lp1905493-f systemd[1]: Finished Load AppArmor profiles.
root@lp1905493-f:~# lxc shell layer2
root@layer2:~# systemctl status apparmor
● apparmor.service - Load AppArmor profiles
     Loaded: loaded (/lib/systemd/system/apparmor.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Wed 2021-03-17 18:40:16 UTC; 2h 31min ago
       Docs: man:apparmor(7)
             https://gitlab.com/apparmor/apparmor/wikis/home/
   Main PID: 105 (code=exited, status=1/FAILURE)

Mar 17 18:40:15 layer2 apparmor.systemd[147]: /sbin/apparmor_parser: Unable to replace "nvidia_modprobe". Permission denied; attempted to load a profile while confined?
Mar 17 18:40:15 layer2 apparmor.systemd[157]: /sbin/apparmor_parser: Unable to replace "/usr/bin/man". Permission denied; attempted to load a profile while confined?
Mar 17 18:40:15 layer2 apparmor.systemd[164]: /sbin/apparmor_parser: Unable to replace "/usr/sbin/tcpdump". Permission denied; attempted to load a profile while confined?
Mar 17 18:40:16 layer2 apparmor.systemd[150]: /sbin/apparmor_parser: Unable to replace "/usr/lib/NetworkManager/nm-dhcp-client.action". Permission denied; attempted to load a profile while confined?
Mar 17 18:40:16 layer2 apparmor.systemd[161]: /sbin/apparmor_parser: Unable to replace "mount-namespace-capture-helper". Permission denied; attempted to load a profile while confined?
Mar 17 18:40:16 layer2 apparmor.systemd[161]: /sbin/apparmor_parser: Unable to replace "/usr/lib/snapd/snap-confine". Permission denied; attempted to load a profile while confined?
Mar 17 18:40:16 layer2 apparmor.systemd[105]: Error: At least one profile failed to load
Mar 17 18:40:16 layer2 systemd[1]: apparmor.service: Main process exited, code=exited, status=1/FAILURE
Mar 17 18:40:16 layer2 systemd[1]: apparmor.service: Failed with result 'exit-code'.
Mar 17 18:40:16 layer2 systemd[1]: Failed to start Load AppArmor profiles.

Revision history for this message
Dan Streetman (ddstreet) wrote :

I wonder if this is actually a problem with the specific apparmor profile that's created by lxd, maybe it doesn't provide enough permissions to allow the container's lxd to correctly pass the apparmor profile down to the nested container. Similar to how lxd locks down containers a bit too tight by default and requires enabling 'security.nesting' just to be able to create a nested container.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in dbus (Ubuntu):
status: New → Confirmed
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Due to a ping on IRC I wanted to summarize the situation here as it seems this still affects people.

In nested LXD container we seem to have multiple issues:
- apparmor service failing to start (might need to work with LXD to sort out why and how to fix it)
  - if it doesn't work at least fail to start more gracefully
  - comment 2 has a workaround to make dbus not insist on apparmor, but that is not a real fix we could generally apply

- snapd snapd.seeded.service needs code to die/exit gracefully in this situation (as it won't work)
  - See comment 7, might have changed since then, but worth a revisit

summary: - cloud-init status --wait hangs indefinitely in a nested lxd container
+ Services (apparmor, snapd.seeded, ...?) fail to start in nested lxd
+ container
Revision history for this message
Jose Manuel Santamaria Lema (panfaust) wrote (last edit ):

Hi there,

thanks of the update. Just in in case anyone else here is interested in a temporary workaround, this is what I did for my use case:
- create a config file for dbus like the one mentioned in comment #2
- apt remove apparmor
- reboot

After that "runlevel", "systemctl is-system-running" and "cloud-init -status --wait" should work.

Last but not least, I would like to mention this issue affects running autopkgtests in nested containers (autopkgtest uses "runlevel" to detect if the container is properly started), I had an interesting conversation about this here (thanks a lot ddstreet for the help!):
https://irclogs.ubuntu.com/2022/06/13/%23ubuntu-devel.html#t15:05

Revision history for this message
James Falcon (falcojr) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.