apparmor uses excessive memory leading to oom kill

Bug #1830502 reported by Ian Johnson
32
This bug affects 6 people
Affects Status Importance Assigned to Milestone
AppArmor
Confirmed
Undecided
Unassigned
apparmor (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

When attempting to load the profile from comment #7, apparmor uses excessive amounts of memory leading to being killed by the OOM killer and thus the apparmor.service failing.

Original bug description:

On Ubuntu 18.04.2 LTS Desktop, after running out of space on my disk, my system was unable to finish booting and I had to go into recovery mode and remove a number of files before the system would boot. After doing so I discovered that now the apparmor.service systemd unit always fails to start. I see this in dmesg:

[ 1066.975360] Out of memory: Kill process 6799 (apparmor_parser) score 796 or sacrifice child
[ 1066.975364] Killed process 6799 (apparmor_parser) total-vm:15057348kB, anon-rss:15046148kB, file-rss:0kB, shmem-rss:0kB
[ 1067.406595] oom_reaper: reaped process 6799 (apparmor_parser), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Whenever apparmor.service is attempted to be started by systemd, i.e. either on boot, or later with `systemctl start apparmor`.

The log from journalctl doesn't show any actual issues with any profiles just this:

-- Reboot --
May 25 17:00:58 systemd[1]: Starting AppArmor initialization...
May 25 17:00:58 apparmor[1521]: * Starting AppArmor profiles
May 25 17:00:58 apparmor[1521]: Skipping profile in /etc/apparmor.d/disable: usr.bin.firefox
May 25 17:00:58 apparmor[1521]: Skipping profile in /etc/apparmor.d/disable: usr.sbin.rsyslogd
May 25 17:01:40 apparmor[1521]: ...fail!
May 25 17:01:40 systemd[1]: apparmor.service: Main process exited, code=exited, status=123/n/a
May 25 17:01:40 systemd[1]: apparmor.service: Failed with result 'exit-code'.
May 25 17:01:40 systemd[1]: Failed to start AppArmor initialization.
May 25 17:04:53 systemd[1]: Starting AppArmor initialization...
May 25 17:04:53 apparmor[4747]: * Starting AppArmor profiles
May 25 17:04:53 apparmor[4747]: Skipping profile in /etc/apparmor.d/disable: usr.bin.firefox
May 25 17:04:53 apparmor[4747]: Skipping profile in /etc/apparmor.d/disable: usr.sbin.rsyslogd
May 25 17:05:25 apparmor[4747]: ...fail!
May 25 17:05:25 systemd[1]: apparmor.service: Main process exited, code=exited, status=123/n/a
May 25 17:05:25 systemd[1]: apparmor.service: Failed with result 'exit-code'.
May 25 17:05:25 systemd[1]: Failed to start AppArmor initialization.

I can see that apparmor profiles are active after doing this (using aa-status), but it's still troubling that apparmor runs into an issue without actually saying what the error is.

Revision history for this message
Ian Johnson (anonymouse67) wrote :
Download full text (5.4 KiB)

FWIW this could be a snapd bug, because while my system was unable to boot, I disabled all the snaps I had installed except the core snap, and then after being able to reboot I now re-enable all the snaps and see some warnings:

May 25 17:32:16 systemd[1]: Starting AppArmor initialization...
May 25 17:32:16 apparmor[21005]: * Starting AppArmor profiles
May 25 17:32:16 apparmor[21005]: Skipping profile in /etc/apparmor.d/disable: usr.bin.firefox
May 25 17:32:16 apparmor[21005]: Skipping profile in /etc/apparmor.d/disable: usr.sbin.rsyslogd
May 25 17:32:16 apparmor[21005]: Warning from /var/lib/snapd/apparmor/profiles/snap.lxd.benchmark (/var/lib/snapd/apparmor/profiles/snap.lxd.benchmark line 485): Unconfined exec qualifier (ux) allows some dangerous environment variables to be passed to the unconfined process; 'man 5 apparmor.d' for details.
May 25 17:32:16 apparmor[21005]: Warning from /var/lib/snapd/apparmor/profiles/snap.lxd.activate (/var/lib/snapd/apparmor/profiles/snap.lxd.activate line 485): Unconfined exec qualifier (ux) allows some dangerous environment variables to be passed to the unconfined process; 'man 5 apparmor.d' for details.
May 25 17:32:16 apparmor[21005]: Warning from /var/lib/snapd/apparmor/profiles/snap.lxd.buginfo (/var/lib/snapd/apparmor/profiles/snap.lxd.buginfo line 485): Unconfined exec qualifier (ux) allows some dangerous environment variables to be passed to the unconfined process; 'man 5 apparmor.d' for details.
May 25 17:32:16 apparmor[21005]: Warning from /var/lib/snapd/apparmor/profiles/snap.lxd.daemon (/var/lib/snapd/apparmor/profiles/snap.lxd.daemon line 533): Unconfined exec qualifier (ux) allows some dangerous environment variables to be passed to the unconfined process; 'man 5 apparmor.d' for details.
May 25 17:32:16 apparmor[21005]: Warning from /var/lib/snapd/apparmor/profiles/snap.lxd.check-kernel (/var/lib/snapd/apparmor/profiles/snap.lxd.check-kernel line 485): Unconfined exec qualifier (ux) allows some dangerous environment variables to be passed to the unconfined process; 'man 5 apparmor.d' for details.
May 25 17:32:16 apparmor[21005]: Warning from /var/lib/snapd/apparmor/profiles/snap.lxd.lxc (/var/lib/snapd/apparmor/profiles/snap.lxd.lxc line 485): Unconfined exec qualifier (ux) allows some dangerous environment variables to be passed to the unconfined process; 'man 5 apparmor.d' for details.
May 25 17:32:16 apparmor[21005]: Warning from /var/lib/snapd/apparmor/profiles/snap.lxd.migrate (/var/lib/snapd/apparmor/profiles/snap.lxd.migrate line 485): Unconfined exec qualifier (ux) allows some dangerous environment variables to be passed to the unconfined process; 'man 5 apparmor.d' for details.
May 25 17:32:16 apparmor[21005]: Warning from /var/lib/snapd/apparmor/profiles/snap.lxd.lxd (/var/lib/snapd/apparmor/profiles/snap.lxd.lxd line 485): Unconfined exec qualifier (ux) allows some dangerous environment variables to be passed to the unconfined process; 'man 5 apparmor.d' for details.
May 25 17:32:34 apparmor[21005]: Warning from /var/lib/snapd/apparmor/profiles/snap.lxd.activate (/var/lib/snapd/apparmor/profiles/snap.lxd.activate line 485): Unconfined exec qualifier (ux) allows some dangerous env...

Read more...

Revision history for this message
John Johansen (jjohansen) wrote :

No the warnings won't cause apparmor to fail, however the kernel killing the apparmor_parser will, and that won't report a parse error.

The Ubuntu apparmor.service calls the apparmor_parser once for each profile. Which means you are getting a some profiles loaded but not all of them.

Can you check your kernel messages/dmesg to see for what reason the kernel is killing the apparmor_parser.

Revision history for this message
Ian Johnson (anonymouse67) wrote :

Well I tried restarting AppArmor using `systemctl start apparmor` while running `dmesg -w -k` and got the following log: https://pastebin.ubuntu.com/p/98zXMsr6Sy/ I don't see a stack trace for apparmor itself, just for chrome and pulseaudio.

Is there anyway to have apparmor.service show what profiles it's trying to load to see what profile it might be trying to load when it is killed?

Revision history for this message
John Johansen (jjohansen) wrote :

I'm not aware of any way to get the apparmor.service to print out what profile it is working on without actually modifying the service

however your dmesg does show the reason for the failure, it looks like the apparmor_parser is being killed by the oom killer

[ 5986.338089] [13520] 0 13520 3056587 3053749 24539136 0 0 apparmor_parser
[ 5986.338090] Out of memory: Kill process 13520 (apparmor_parser) score 646 or sacrifice child
[ 5986.338095] Killed process 13520 (apparmor_parser) total-vm:12226348kB, anon-rss:12214996kB, file-rss:0kB, shmem-rss:0kB

we should be able to narrow down which profile is causing the problem by comparing the set of profiles being reported as loaded to those that are on the system.

We can then manually run the apparmor_parser to see which profile is using some much memory to compile

Revision history for this message
Ian Johnson (anonymouse67) wrote :

How would you recommend I go about checking which profiles are actually loaded and which profiles are reported as loaded? I have this from aa-status: https://pastebin.ubuntu.com/p/c2FbrndDzs/

Revision history for this message
John Johansen (jjohansen) wrote :

We can get a diff of loaded vs. expected profiles

for a straight list of loaded profiles names, you can do
  $ sudo cat /sys/kernel/security/apparmor/profiles
  /snap/core/6964/usr/lib/snapd/snap-confine (enforce)
  /snap/core/6964/usr/lib/snapd/snap-confine//mount-namespace-capture-helper (enforce)
  firefox (enforce)
  firefox//sanitized_helper (enforce)
  firefox//lsb_release (enforce)
  ...

we can then get a list of profile names from apparmor_parser without doing a compile using
  $ sudo apparmor_parser -N /etc/apparmor.d/ /var/lib/snapd/apparmor/profiles/
  udm-extractor
  ubuntu-printing-app
  /usr/sbin/tcpdump
  ...

so a quick and dirty script to get the diff
  $ sudo cat /sys/kernel/security/apparmor/profiles | awk '{ print $1 }' > /tmp/foo ; sudo apparmor_parser -N /etc/apparmor.d/ /var/lib/snapd/apparmor/profiles/ >> /tmp/foo ; sort /tmp/foo | uniq -c | grep -e ' 1 '

  Skipping profile in /etc/apparmor.d/disable: usr.lib.libreoffice.program.oosplash
  Ignoring: 'usr.bin.firefox~'
      1 /etc/apparmor.d/usr.bin.firefox
      1 libvirt-79eb4c35-23a7-44bb-8894-aa97ca616850
  ...

basically anything with that doesn't show up in both gets a count of 1.

We can further distinguish profiles that have been loaded based on time if we need to with
  $ ls -l /sys/kernel/security/apparmor/policy/profiles/
  total 0
  drwxr-xr-x 2 root root 0 May 21 23:16 content-hub-clipboard.1
  drwxr-xr-x 2 root root 0 May 21 23:16 content-hub-peer-picker.2
  drwxr-xr-x 2 root root 0 May 21 23:16 default.0
  drwxr-xr-x 2 root root 0 May 21 23:16 etc.apparmor.d.skype.6
  ...

and we can try to load any of the profiles we find that failed to load individually with
  $ apparmor_parser -r $profile

or if need be one by one via shell scripting (sadly the parser is missing a direct way to dump which profile is being worked on when it is processing multiple dirs) and it can't do it when killed from the oom killer either.

with this we should be able to track down which profile is failing

Revision history for this message
Ian Johnson (anonymouse67) wrote :

So I ran your snippet to determine which profiles weren't loaded and the only one which wasn't loaded was:

```
$ sudo cat /sys/kernel/security/apparmor/profiles | awk '{ print $1 }' > /tmp/foo ; sudo apparmor_parser -N /etc/apparmor.d/ /var/lib/snapd/apparmor/profiles/ >> /tmp/foo ; sort /tmp/foo | uniq -c | grep -e ' 1 '
Skipping profile in /etc/apparmor.d/disable: usr.bin.firefox
Skipping profile in /etc/apparmor.d/disable: usr.sbin.rsyslogd
      1 snap-update-ns.layouts-test
```

which is a local snap I was developing quite some time ago. I will attach the associated apparmor profile that was generated, but curiously, when I try to load that profile manually with apparmor_parser it succeeds:

```
$ sudo apparmor_parser -r /var/lib/snapd/apparmor/profiles/snap-update-ns.layouts-test
$ echo $?
0
```

with the following output in the system journal indicating that the load was successful:

```
May 29 11:23:22 audit[21275]: AVC apparmor="STATUS" operation="profile_load" profile="unconfined" name="snap-update-ns.layouts-test" pid=21275 comm="apparmor_parser"
May 29 11:23:22 kernel: audit: type=1400 audit(1559147002.259:420): apparmor="STATUS" operation="profile_load" profile="unconfined" name="snap-update-ns.layouts-test" pid=21275 comm="apparmor_parser"
May 29 11:23:22 sudo[21273]: pam_unix(sudo:session): session closed for user root
```

and no kernel messages regarding apparmor_parser being killed from the OOM killer.

After doing this, then there is not a diff between the expected loaded profiles and the actual loaded profiles using your snippet, but if I try again to start apparmor.service it still gets killed by the OOM killer with similar output as above.

Revision history for this message
Ian Johnson (anonymouse67) wrote :

Ah actually, if I move that profile out of the way, then `systemctl start apparmor` starts immediately. So the issue must be with that profile being too large (and indeed it is 4-5 MB).

Revision history for this message
John Johansen (jjohansen) wrote :

So yes that does appear to be part of it. I pulled your profile and tested just a compile

time apparmor_parser -QT -D dfa-stats /tmp/layouts-test-1.txt

Created dfa: states 16780 proto { cache: size=16780 dups=36386 longest=1244 avg=6 }, nnodes { cache: size=16761 dups=36405 longest=1243 avg=5 }, anodes { cache: size=11 dups=35437 longest=2 avg=1 }
Minimized dfa: final partitions 699 (accept 73) init 8 (accept 7)
Created dfa: states 34473 proto { cache: size=34473 dups=21674 longest=598 avg=7 }, nnodes { cache: size=34468 dups=21679 longest=598 avg=7 }, anodes { cache: size=6 dups=4992 longest=2 avg=1 }
Minimized dfa: final partitions 27273 (accept 2095) init 4 (accept 3)

real 0m27.084s
user 0m26.735s
sys 0m0.348s

which is quite slow, but can happen for big profiles. With Valgrind --tool=massif reporting a peak heap usage of 884.5MB

However Ubuntu defaults to using -O no-expr-simplify because it can speed up small profiles. With that I get

time apparmor_parser -QT -D dfa-stats -O no-expr-simplify /tmp/layouts-test-1.txt

Created dfa: states 40915 proto { cache: size=40915 dups=83997 longest=4870 avg=9 }, nnodes { cache: size=40633 dups=84279 longest=4869 avg=8 }, anodes { cache: size=11 dups=82787 longest=2 avg=1 }
Minimized dfa: final partitions 699 (accept 73) init 8 (accept 7)
Created dfa: states 44769 proto { cache: size=44769 dups=28309 longest=33583 avg=226 }, nnodes { cache: size=44495 dups=28583 longest=33583 avg=226 }, anodes { cache: size=6 dups=8500 longest=2 avg=1 }
Minimized dfa: final partitions 27273 (accept 2095) init 4 (accept 3)

real 0m45.947s
user 0m39.770s
sys 0m6.166s

with valgrind --tool=massif reporting a peak usage of 15.4 GB

ouch

and that isn't the worst of it, because the initscripts run multiple compiles in parallel. Mind you most compiles only take a few MB, but still all of that happening at the same time puts a lot of pressure on the system.

Revision history for this message
John Johansen (jjohansen) wrote :

Once you can get a profile to compile apparmor can cache the compile for you, so ideally the compile only needs to happen once per kernel.

But I completely get even then, with this profile that is a problem.

Can I keep the profile, and add it to a test suite, to look into reducing the compilers memory usage?

Revision history for this message
Ian Johnson (anonymouse67) wrote :

Yes, certainly use the profile for whatever you can use it for. Would you like me to edit the description on this bug to reflect the actual underlying cause here or should I just close this and file a new bug for the memory usage of this profile? I'm no expert here but I think 15.4 GB memory usage seems a tad high and thus seems buggy :-)

Revision history for this message
Jamie Strandboge (jdstrand) wrote :

@Ian - how did you generate this profile? Is this something that snapd generated (it doesn't look like typical snap-update-ns profiles...)? If it did, can you attach the snap.yaml (this seems like atypical usage of the layouts feature)?

Revision history for this message
Ian Johnson (anonymouse67) wrote :

@Jamie yes this was generated by snapd, the original snapcraft.yaml is attached.

And also yes I fully understand this was an atypical usage of layouts, I was experimenting with using layouts to make it seem to a snap application that an additional package was installed in the base snap. I generated the yaml section using a bit of code that I seem to have lost now, but the gist of it was to find all files in a given deb package (python2.7 in this case) and create a layout symlink for every file to point to the real files underneath $SNAP. I actually remember abandoning this originally specifically because apparmor crashed when attempting to install the snap and forgot to file a bug here, so here's the bug finally...

Revision history for this message
John Johansen (jjohansen) wrote :

@Ian - renaming this bug wfm

summary: - apparmor fails to start with no parser errors
+ apparmor uses excessive memory leading to oom kill
description: updated
Revision history for this message
Sergio Cazzolato (sergio-j-cazzolato) wrote :
Download full text (8.2 KiB)

Hi, running the snapd tests on i386 I see the following error that could be related to this. It is failing with pc-kernel on beta and candidate but I coun't reproduce it with the version on stable.

> snap list
Name Version Rev Tracking Publisher Notes
core 16-2.39.1 7122 beta canonical* core
pc 16.04-0.10 34 candidate canonical* gadget
pc-kernel 4.4.0-149.175 219 candidate canonical* kernel

Logs:
May 30 03:57:34 localhost.localdomain snapd[23208]: May 30 03:57:33 localhost.localdomain kernel: vmap allocation for size 73728 failed: use vmalloc=<size> to increase size.
May 30 03:57:34 localhost.localdomain snapd[23208]: May 30 03:57:33 localhost.localdomain kernel: vmalloc: allocation failure: 68497 bytes
May 30 03:57:34 localhost.localdomain snapd[23208]: May 30 03:57:33 localhost.localdomain kernel: apparmor_parser: page allocation failure: order:0, mode:0x24000c2
May 30 03:57:34 localhost.localdomain snapd[23208]: May 30 03:57:33 localhost.localdomain kernel: CPU: 1 PID: 23415 Comm: apparmor_parser Not tainted 4.4.0-149-generic #175-Ubuntu
May 30 03:57:34 localhost.localdomain snapd[23208]: May 30 03:57:33 localhost.localdomain kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
May 30 03:57:34 localhost.localdomain snapd[23208]: May 30 03:57:33 localhost.localdomain kernel: c1b15967 70e2bd9c 00000286 d0cf7e00 c13c12ef c1a1ce6c 00000001 d0cf7e30
May 30 03:57:34 localhost.localdomain snapd[23208]: May 30 03:57:33 localhost.localdomain kernel: c11821b6 c1a1b220 f479a700 00000000 024000c2 d0cf7e44 c1a1ce6c d0cf7e18
May 30 03:57:34 localhost.localdomain snapd[23208]: May 30 03:57:33 localhost.localdomain kernel: 70e2bd9c 00010b91 00000000 d0cf7e60 c11ba86f 024000c2 00000000 c1a1ce6c
May 30 03:57:34 localhost.localdomain snapd[23208]: May 30 03:57:33 localhost.localdomain kernel: Call Trace:
May 30 03:57:34 localhost.localdomain snapd[23208]: May 30 03:57:33 localhost.localdomain kernel: [<c13c12ef>] dump_stack+0x58/0x79
May 30 03:57:34 localhost.localdomain snapd[23208]: May 30 03:57:33 localhost.localdomain kernel: [<c11821b6>] warn_alloc_failed+0xd6/0x110
May 30 03:57:34 localhost.localdomain snapd[23208]: May 30 03:57:33 localhost.localdomain kernel: [<c11ba86f>] __vmalloc_node_range+0x1ef/0x210
May 30 03:57:34 localhost.localdomain snapd[23208]: May 30 03:57:33 localhost.localdomain kernel: [<c11ba8f6>] __vmalloc_node+0x66/0x70
May 30 03:57:34 localhost.localdomain snapd[23208]: May 30 03:57:33 localhost.localdomain kernel: [<c134f978>] ? __aa_kvmalloc+0x28/0x60
May 30 03:57:34 localhost.localdomain snapd[23208]: May 30 03:57:33 localhost.localdomain kernel: [<c11ba978>] vmalloc+0x38/0x40
May 30 03:57:34 localhost.localdomain snapd[23208]: May 30 03:57:33 localhost.localdomain kernel: [<c134f978>] ? __aa_kvmalloc+0x28/0x60
May 30 03:57:34 localhost.localdomain snapd[23208]: May 30 03:57:33 localhost.localdomain kernel: [<c134f978>] __aa_kvmalloc+0x28/0x60
May 30 03:57:34 localhost.localdomain snapd[23208]: May 30 03:57:33 localhost.localdomain kernel: [<c134a7f4>] aa_simple_write_to_buffer+0x34/0x90
May 30 03:57:34 localhost...

Read more...

Revision history for this message
John Johansen (jjohansen) wrote :

@Sergio: your issue is different. It is being killed during a kernel operation (sys_write) due to a vmalloc failure, where this bug is occurring during a userspace compile. Please open a new bug

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in apparmor (Ubuntu):
status: New → Confirmed
Revision history for this message
Ivan Baldo (ibaldo) wrote :

The biggest problem is that it isn't easily disabled because it is hardcoded in the script instead of being in /etc/apparmor/parser.conf.
Instead of hardcoding, it would had been better to just update that conffile and let dpkg update it if the user didn't change it (which is highly likely) or on new installs.

In /lib/apparmor/functions we have this:
# LP: #1383858 - expr tree simplification is too slow for some
# policy on 32bit ARM, so disable it for now
cache_extra_args=
if [ -d "$PROFILES_CACHE_VAR" ] && [ "$pdir" = "$PROFILES_VAR" ]; then
    cache_extra_args="-O no-expr-simplify"
fi

So now for machines with 2Gib of RAM and Snaps, Ubuntu 18.04 has become unusable.
I know the minimum requirements are 4Gib but it did actually work fine with 2Gib before so it is a bit sad to loose that capability.

My suggestion is to just revert that change and do it in parser.conf instead, so at least we have the option to easily modify it to retain some 2Gib support.

Thanks a lot for considering this!!!

Revision history for this message
Jamie Strandboge (jdstrand) wrote :

@Ivan, we are going to fix snapd for the excessive memory usage. AppArmor upstream already uses expr-simplify by default and newer release of Ubuntu use parser.conf to set -O no-expr-simplify so users can manage the setting like any other conffile.

Changed in apparmor:
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.