Nested LXD install fails with snapd 2.42.4 (current stable core snap)

Bug #1855355 reported by Stéphane Graber
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
AppArmor
Confirmed
Undecided
Unassigned
snapd
Fix Released
Critical
Unassigned

Bug Description

The LXD daily cross-architecture testing this day ended up failing for all tracks, channels and architectures.The result is: https://jenkins.linuxcontainers.org/job/lxd-test-snap-architectures/914/

Unfortunately, those test instances test using the stable core snap, unlike our non-nested snapd enabled test runners which use the candidate core snap, so this wasn't detected until the new snapd hit stable. Also, the snapd test for lxd never attempts to install a nested lxd, so that didn't catch the issue either.

I tracked this down to upstream commit e7afbc34b1d630aeae4a7d20c34da75f4cb67546, specifically the addition of "deny unix," which, likely due to an apparmor parser or kernel bug is causing the execution of apparmor_parser by "snap run" to fail with EACCESS.

So far the only snap I could find which is affected is the LXD snap, but that does mean that anyone who's using nested lxd and uses the snap to install it is currently broken with a non-obvious way to fix things as a "snap revert" of lxd won't do anything, you need to revert the core snap and make sure that snapd is restarted and the apparmor profile loaded from the reverted core.

A reproducer is:
 - lxc launch ubuntu:18.04 c1 -c security.nesting=true
 - lxc exec c1 -- snap install lxd

The issue was confirmed by Jamie, where he noted:
"""
I downgrading to 8159 (2.42.2) and it does not have the unix rule. I then added to /var/lib/snapd/apparmor/snap-confine/foo 'deny unix,', the did apparmor_parser -r /var/lib/snapd/apparmor/profiles/snap-confine* then tried to install lxd and it failed.
"""

He also added:
"""
fyi, this is a more specific rule (ie, to address the thing that prompted the PR in the first place): deny unix (receive, send) type=stream addr=none peer=(addr=none), and it also causes lxd to fail
"""

We'd appreciate if this particular rule could be removed from the affected apparmor profile and the stable snap be updated with the fix ASAP so not to leave our users of nested LXD affected.

It's also unclear to me why only LXD is affected, but trying a small selection of similar snaps, I couldn't find another one which failed.I wonder if this is somehow tied to LXD's use of socket activation though the socket isn't accessed at all in that part of the startup process so that'd still be odd.

Revision history for this message
Jamie Strandboge (jdstrand) wrote :
Changed in snapd:
status: New → Confirmed
milestone: none → 2.43
Changed in snapd:
importance: Undecided → Critical
status: Confirmed → In Progress
Revision history for this message
Jamie Strandboge (jdstrand) wrote :

FYI, this is in 2.42.5, currently in candidate. I'm told it will be released to stable this week.

Changed in snapd:
status: In Progress → Fix Committed
Revision history for this message
Jamie Strandboge (jdstrand) wrote :

FYI, field saw this denial with 2.42.4:

[2405094.359882] audit: type=1400 audit(1575989789.008:11249): apparmor="DENIED" operation="file_inherit" namespace="root//lxd-juju-cc7269-0-lxd-16_<var-lib-lxd>" profile="snap.prometheus-openstack-exporter.prometheus-openstack-exporter" name="/apparmor/.null" pid=1721084 comm="snap-exec" requested_mask="wr" denied_mask="wr" fsuid=100000 ouid=0

This was tracked down to be /run/systemd/journal/stdout being the open FD (a *named* socket) with the 'deny unix,' rule causing the file_inherit denial.

Upgrading to 2.42.5 (ie, with 'deny unix,' no longer present in the policy), things work again.

Revision history for this message
Pedro Guimarães (pguimaraes) wrote :

Hi, we've been seeing a similar situation with snap prometheus-openstack-exporter. This is deployed using a charm, which sets up a service: https://jaas.ai/prometheus-openstack-exporter/10

The service itself should manage a webserver. The issue we were seeing was: (1) we could telnet to the server, even wget but we were getting empty responses; and (2) stopping the service and running it manually: everything worked fine.

The conclusions of this issue can be found here: https://bugs.launchpad.net/prometheus-openstack-exporter-charm/+bug/1855865, including straces comparing both with and without service running

Eventually, we've figured out that writing to systemd's stdout was causing the issue. Adding a apparmor rule resolved it.

I ran a test with snap core 2.42.5 and it worked fine on my case. No need to setup rules by hand.

Revision history for this message
Jamie Strandboge (jdstrand) wrote :

https://bugs.launchpad.net/snapd/+bug/1856057 is another manifestation of this bug.

Revision history for this message
Jamie Strandboge (jdstrand) wrote :

Ok, I spent some time on this and came up with several sets of instructions to reproduce (and one that shows things work while not stacked):

* https://people.canonical.com/~jamie/bug1855355/REPRODUCER-lxd-orig.txt (this bug's original report)
* https://people.canonical.com/~jamie/bug1855355/REPRODUCER-lxd-no-nesting.txt (reproducer without nested lxd)
* https://people.canonical.com/~jamie/bug1855355/REPRODUCER-lxd-simplest.txt (reproducer without snapd in lxd)
* https://people.canonical.com/~jamie/bug1855355/REPRODUCER-works-without-lxd.txt (REPRODUCER-lxd-simplest.txt without lxd. No bug)

John, REPRODUCER-lxd-simplest.txt is probably the place to start, but it unfortunately is still a complicated setup that involves installing lxd in a vm, then configuring a systemd unit that runs a command pipeline similar to how snapd sets up a daemon unit (ie, unconfined thing runs a confined setuid launcher which aa_change_onexec()s to another thing).

I could not readily reproduce this without systemd and it should be noted that under systemd's ExecStart, /proc/self/fd/0 points to /dev/null, /proc/self/fd/1 points to a socket and /proc/self/fd/2 points to another socket. These sockets are presumably journald (though I didn't confirm it was /run/systemd/journal/stdout specifically).

Also worth noting is that /run/systemd/journal/stdout is a *named* socket which is being affected by the 'deny unix,' rule.

The REPRODUCER-works-without-lxd.txt steps show there seems to be no bug when not using stacked profiles.

I wanted to distill this down further with just some stacked profiles, and maybe a small binary that to clone() as needed to take lxd out of the equation, but ran out of time.

Changed in apparmor:
status: New → Confirmed
Michael Vogt (mvo)
Changed in snapd:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.