cloud-init status --wait hangs indefinitely in a nested lxd container

Bug #1905493 reported by Ian Johnson on 2020-11-25
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
AppArmor
Undecided
Unassigned
cloud-init
Undecided
Unassigned
snapd
Low
Unassigned
dbus (Ubuntu)
Undecided
Unassigned
systemd (Ubuntu)
Undecided
Unassigned

Bug Description

When booting a nested lxd container inside another lxd container (just a normal container, not a VM) (i.e. just L2), using cloud-init -status --wait, the "." is just printed off infinitely and never returns.

Dan Watkins (oddbloke) wrote :

Hi Ian,

I've just launched such a container and I see a bunch of non-cloud-init errors in the log and when I examine `systemctl list-jobs`, I see that the two running jobs are systemd-logind.service and snapd.seeded.service:

root@certain-cod:~# systemctl list-jobs
JOB UNIT TYPE STATE
114 cloud-final.service start waiting
125 snapd.autoimport.service start waiting
143 systemd-update-utmp-runlevel.service start waiting
116 cloud-config.service start waiting
1 graphical.target start waiting
691 systemd-logind.service start running
99 unattended-upgrades.service start waiting
110 cloud-init.target start waiting
115 snapd.seeded.service start running
2 multi-user.target start waiting

10 jobs listed.

Examining the journal, I see that systemd-logind.service is in a restart loop:

root@certain-cod:~# journalctl -u systemd-logind.service | grep Failed\ w
Dec 01 22:37:43 certain-cod systemd[1]: systemd-logind.service: Failed with result 'timeout'.
Dec 01 22:39:13 certain-cod systemd[1]: systemd-logind.service: Failed with result 'timeout'.
Dec 01 22:40:44 certain-cod systemd[1]: systemd-logind.service: Failed with result 'timeout'.
Dec 01 22:42:14 certain-cod systemd[1]: systemd-logind.service: Failed with result 'timeout'.
Dec 01 22:43:44 certain-cod systemd[1]: systemd-logind.service: Failed with result 'timeout'.
Dec 01 22:45:14 certain-cod systemd[1]: systemd-logind.service: Failed with result 'timeout'.
Dec 01 22:46:45 certain-cod systemd[1]: systemd-logind.service: Failed with result 'timeout'.
Dec 01 22:48:15 certain-cod systemd[1]: systemd-logind.service: Failed with result 'timeout'.
Dec 01 22:49:45 certain-cod systemd[1]: systemd-logind.service: Failed with result 'timeout'.

This is blocking boot before cloud-init's later stages run, so as it is correctly indicating that it hasn't yet completed, I'm marking this Invalid for cloud-init. I'll add a systemd task instead, as that looks to be the source of the issue.

Cheers,

Dan

Changed in cloud-init:
status: New → Invalid
Dan Streetman (ddstreet) wrote :

The systemd-logind problem is due to dbus defaulting to apparmor mode 'enabled', but apparmor can't do much of anything inside a container so it fails to start, and dbus can't contact it.

In the 2nd level container, create a file like '/etc/dbus-1/system.d/no-apparmor.conf' with content:

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE busconfig PUBLIC
 "-//freedesktop//DTD D-BUS Bus Configuration 1.0//EN"
 "http://www.freedesktop.org/standards/dbus/1.0/busconfig.dtd">
<busconfig>
  <apparmor mode="disabled"/>
</busconfig>

Then restart the 2nd level container and recheck systemd-logind which should now work

Of course, fixing dbus should be a bit smarter about only disabling its use of apparmor if it's inside a container.

However, cloud-init status --wait still hangs after systemd-logind starts up, so that wasn't the original problem (or at least wasn't the only problem)

Changed in systemd (Ubuntu):
status: New → Invalid
Changed in cloud-init:
status: Invalid → New
Dan Watkins (oddbloke) wrote :

Given that the logind issue is an AppArmor issue and, per my previous comment, "the two running jobs are systemd-logind.service and snapd.seeded.service", I suspect that we'll find that snapd is running into similar sorts of issues. I'll take a quick look now.

Ian Johnson (anonymouse67) wrote :

FWIW I know what the snapd issue is, the issue is that snapd does not and will not work in a nested LXD container, we need to add code to make snapd.seeded.service die/exit gracefully in this situation.

Changed in snapd:
status: New → Confirmed
importance: Undecided → Low
Dan Watkins (oddbloke) wrote :

Yep, that's what I've found; cloud-init is just waiting for its later stages to run, which are blocked by snapd.seeded.service exiting.

Changed in cloud-init:
status: New → Invalid
Dan Streetman (ddstreet) wrote :

it's interesting that apparmor appears to work ok in the first-level container, but fails in the nested container, e.g.:

$ lxc shell lp1905493-f
root@lp1905493-f:~# systemctl status apparmor
● apparmor.service - Load AppArmor profiles
     Loaded: loaded (/lib/systemd/system/apparmor.service; enabled; vendor preset: enabled)
     Active: active (exited) since Wed 2021-03-17 18:17:44 UTC; 2h 53min ago
       Docs: man:apparmor(7)
             https://gitlab.com/apparmor/apparmor/wikis/home/
    Process: 118 ExecStart=/lib/apparmor/apparmor.systemd reload (code=exited, status=0/SUCCESS)
   Main PID: 118 (code=exited, status=0/SUCCESS)

Mar 17 18:17:44 lp1905493-f systemd[1]: Starting Load AppArmor profiles...
Mar 17 18:17:44 lp1905493-f apparmor.systemd[118]: Restarting AppArmor
Mar 17 18:17:44 lp1905493-f apparmor.systemd[118]: Reloading AppArmor profiles
Mar 17 18:17:44 lp1905493-f apparmor.systemd[129]: Skipping profile in /etc/apparmor.d/disable: usr.sbin.rsyslogd
Mar 17 18:17:44 lp1905493-f systemd[1]: Finished Load AppArmor profiles.
root@lp1905493-f:~# lxc shell layer2
root@layer2:~# systemctl status apparmor
● apparmor.service - Load AppArmor profiles
     Loaded: loaded (/lib/systemd/system/apparmor.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Wed 2021-03-17 18:40:16 UTC; 2h 31min ago
       Docs: man:apparmor(7)
             https://gitlab.com/apparmor/apparmor/wikis/home/
   Main PID: 105 (code=exited, status=1/FAILURE)

Mar 17 18:40:15 layer2 apparmor.systemd[147]: /sbin/apparmor_parser: Unable to replace "nvidia_modprobe". Permission denied; attempted to load a profile while confined?
Mar 17 18:40:15 layer2 apparmor.systemd[157]: /sbin/apparmor_parser: Unable to replace "/usr/bin/man". Permission denied; attempted to load a profile while confined?
Mar 17 18:40:15 layer2 apparmor.systemd[164]: /sbin/apparmor_parser: Unable to replace "/usr/sbin/tcpdump". Permission denied; attempted to load a profile while confined?
Mar 17 18:40:16 layer2 apparmor.systemd[150]: /sbin/apparmor_parser: Unable to replace "/usr/lib/NetworkManager/nm-dhcp-client.action". Permission denied; attempted to load a profile while confined?
Mar 17 18:40:16 layer2 apparmor.systemd[161]: /sbin/apparmor_parser: Unable to replace "mount-namespace-capture-helper". Permission denied; attempted to load a profile while confined?
Mar 17 18:40:16 layer2 apparmor.systemd[161]: /sbin/apparmor_parser: Unable to replace "/usr/lib/snapd/snap-confine". Permission denied; attempted to load a profile while confined?
Mar 17 18:40:16 layer2 apparmor.systemd[105]: Error: At least one profile failed to load
Mar 17 18:40:16 layer2 systemd[1]: apparmor.service: Main process exited, code=exited, status=1/FAILURE
Mar 17 18:40:16 layer2 systemd[1]: apparmor.service: Failed with result 'exit-code'.
Mar 17 18:40:16 layer2 systemd[1]: Failed to start Load AppArmor profiles.

Dan Streetman (ddstreet) wrote :

I wonder if this is actually a problem with the specific apparmor profile that's created by lxd, maybe it doesn't provide enough permissions to allow the container's lxd to correctly pass the apparmor profile down to the nested container. Similar to how lxd locks down containers a bit too tight by default and requires enabling 'security.nesting' just to be able to create a nested container.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers