Online profile replacement appears to cause severe CPU contention (likely RCU-related?)

Bug #1856256 reported by Bill F on 2019-12-12
This bug affects 1 person
Affects Status Importance Assigned to Milestone

Bug Description

First of all, I would like to loudly disclaim that this report is very qualitative in nature, and I cannot reliably reproduce the problem. However, my hope is that a clear description of the observed behavior will be sufficient to indicate potential underlying causes to someone with the right knowledge.

That said, here is the behavior:

While "apparmor_parser -r" is running, on a large (> 50 cores) host with ~50 apache server processes (which use AA subprofiles, primarily for file system isolation), the volume of requests handled drops precipitously (from ~thousands/minute to <100/minute), while both apparmor_parser and the apache processes appear to be consuming high amounts of CPU continuously (apparmor_parser, specifically, was consuming at least a full core in all samples observed).

Furthermore, and this is the interesting part, *every 10 seconds*, two things occur:

1) AA logs one message of the form below, where the sequence of $profile_name values (again, one every 10 seconds) represents a lexicographically-ordered iteration over all of the relevant AA profiles on this host.

$timestamp $host kernel: [...] audit: type=1400 audit(...): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="$profile_name" pid=<...> comm="apparmor_parser"

2) A burst of requests are successfully processed.

Unfortunately, someone rebooted the host before I was able to capture any additional information (or attempt any finer-grained remediation measures), and the host came back up in a nominal state.

Again, I realize all of this is very vague/high-level, but two things stand out to me as particularly significant here:

1) The behavior described above began *immediately* after the invocation of "appamor_parser -r", and that "make some progress every 10 seconds" periodicity strongly implies some sort of resource contention. And, for whatever it's worth from a random person on the internet, a reasonably thorough forensic analysis has uncovered absolutely no other potential sources of that contention/periodicity.

2) From "If a CPU is still holding out 10 seconds into the grace period, RCU will invoke resched_cpu() on it regardless of its nohz_full state."

My initial thought was that this behavior smelled like some kind of spinlock contention -- but I simply could not explain the consistency of the 10-second intervals, at which both apparmor_parser and the apache processes (and, for that matter, various other processes running on the host) were able to make progress.

The only thing I can think of that might explain this magical "every 10 seconds" behavior would be some kind of an RCU-related bug (e.g. an infinite loop inside of a read-side critical section).

One other bit of nuance that might be worth sharing (which I elided in the description above in the hopes of increasing the odds of someone reading this far :)) is that the system degradation occurred in two phases:

a) Immediately after apparmor_parser started running, the degradation (in particular, in terms of request-processing throughput) began, but after about ~15 seconds, the request volume briefly appeared to recover.

b) However, about ~10 seconds later, during the same second that the very first "same as current profile, skipping" message was logged, the degradation (and the unmistakeable 10-second periodicity) began in earnest and continued until reboot.

Kernel info:
$ uname -a
Linux my-host 5.0.0-1025-gcp #26~18.04.1-Ubuntu SMP Mon Nov 11 13:09:18 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Package info:
$ dpkg -l | grep -i apparmor
ii apparmor 2.12-4ubuntu5.1 amd64 user-space parser utility for AppArmor
ii apparmor-profiles 2.12-4ubuntu5.1 all experimental profiles for AppArmor security policies
ii apparmor-utils 2.12-4ubuntu5.1 amd64 utilities for controlling AppArmor
ii libapparmor1:amd64 2.12-4ubuntu5.1 amd64 changehat AppArmor library
ii python3-apparmor 2.12-4ubuntu5.1 amd64 AppArmor Python3 utility library
ii python3-libapparmor 2.12-4ubuntu5.1 amd64 AppArmor library Python3 bindings

Bill F (bill-f-666) on 2019-12-12
summary: Online profile replacement appears to cause severe CPU contention
+ (likely RCU-related?)
Bill F (bill-f-666) on 2019-12-13
description: updated
John Johansen (jjohansen) wrote :

How many profiles and subprofiles are there?

Can you provide the profile set used in this bug report to test against?

How many processors does the system have?

How much memory does the system have? And how loaded is the system?

So first I will provide a generic overview of profile replacement and then I will try to get to your individual points.

AppArmor does make use of rcu to try to avoid take locks as much as possible during mediation. This is a deliberate design decision as profile replacement should take place far less than the actual mediation checks. If there are a lot of objects flooding the rcu queue, rcu is known to regress and cause its own slow downs. Exactly how will depend on the kernel.

Profile replacement is also can be a costly operation in that locking and cache overheads come into play.

Profile replacement takes place in phases. Most of them can overlap with concurrent replacements.

1. Compile - there is check done to see if there is a current previously compiled version of the profile to load. If there is the apparmor_parser will load that into the kernel. If not a compile will be done which can be cpu intensive (use 100% of the cpu), and can use multiple cpus (depending on how --jobs is set, the default is the number of cpus).

Currently the cache check and compile is done for each profile file. So for multiple profiles this step can overlap the following steps.

2. Profile load. The profile is loaded into the kernel and the kernel unpacks and verifies the profile being loaded.

3. Duplicate elimination check. The loaded profile is checked against the current profiles if it is found to be an exact duplicate. A message is logged that the profile is a duplicate and the rest of the steps are skipped.

4. Profile replacement. This involves taking locks, that will block other profile replacements (steps 1-3 can happen currently) and will block certain tasks like the creation of new labels. So depending on the operation done this could cause a slow down due to lock contention.

The live profile is replaced with the new profile and the label tree is walked to replace all labels with references to the old profile to the new profile.

The locks are released and profile replacement is done from a userspace perspective.

5. Task profile replacement. Individual tasks detect that their confinement label is stale and replace it with the new label generated in step 4. Putting their references to the old label.

6. Once all references are put freeing of labels use rcu to put the label on rcu call back.

Profile replacement will also cause cache hits to miss forcing revalidations which means fast past mediation will be lost until the cache is replaced. Hence mediation in general will be slowed down for a time due to replacement. The exact amount of time is dependent on the application access patterns and kernel version.

John Johansen (jjohansen) wrote :
Download full text (3.2 KiB)

0. the apparmor_parser consuming a core or more during compile is expected. See step 1 above.

  Are you seeing cache messages being logged?
  Would you be willing to add a flag to parser.conf that would log more info about caching of the complile?

1. This message is logged during the duplicate elimination check. So the profile being loaded is being dropped. It is possible this will cause the profile just loaded to use rcu, to free the profile or some of the components allocated during unpack.

  Does the exact same $profile_name get logged multiple times? If so this would indicate userspace is swamping the kernel with multiple replacement requests when not necessary.

2. Agreed this does sound like resource contention. The goal is to track down what exactly is the bottle neck, so it can be fixed.

The apparmor_parser when invoked will immediately consume resources.

2.1 Compiles can be cpu and memory intensive so it would be good to get access to the policy on the machine to test what kind of resources are being. Compiles however do not cause locking issues, nor rcu problems.

2.2. The majority of the load phases 1-4 are attributed to the apparmor_parser. phase 1 in userspace and 2-4 in the kernel.

2.3. phase 4 does take spinlocks, so this could certainly be part of the problem. The load is structured to do as much of the work as possible before the locks are taken, and they are released, and then re-aquired where possible to allow for pre-empt points.

2.4 a) replacement will immediately degrade performance, whether due to the compile consuming resources or due the load and locking.

I would assume the first part of the recovery was due to some of the replacement being done. Eg. maybe the phase 1 compile. This would free up cpu resources and reduce some contention.

2.5 b) the periodic 10s slowdown does sound like rcu. The message being logged is odd in that this is not usually considered the most resource intensive part of a replacement. Indeed the duplicate elimination is done to avoid some of the locking and resource, cache churn that happen due to phase 4-6.

Duplicate eliminate will do hash comparisons first. And only if the hash is an exact match a full memory compare. The full memory compare does take some cpu time but is generally very fast.

With that said this phase is done under a mutex lock, which will block other replacements behind it. If there is a lot of resource contention this could account for the behavior seen in that each outstanding replacement will be queued up waiting on the mutex lock. One will be processed at a time, and kernel scheduling will determine when they wake up.

If enough of these apparmor_parser tasks are queued up waiting to perform a replacement they could consume a significant amount of memory resources just sitting there waiting because in general userspace programs only return memory to the system when they exist.

So for example if there are 50 processes queued up, and each used 50 MB for a compile we are looking at 2.5 GB of memory not available for caches etc.

This does not explain all the behavior you have seen but is a potential source of problems. This phase could p...


John Johansen (jjohansen) wrote :

If attaching policy to this bug is not acceptable (completely understandable), would you be willing to email me the policy?

Bill F (bill-f-666) wrote :

Thank you for your response!

> How many profiles and subprofiles are there?
There are not many top-level profiles here, but this particular incident did involve replacing a profile with nearly 200 sub-profiles.

> Can you provide the profile set used in this bug report to test against?
Sorry, I'd rather not provide the profiles, which may contain (at least implicitly) PII. But please let me know if there are any specific (or abstract) characteristics of the profiles that you have in mind, etc, and I'd be happy to share anonymized info.

How many processors does the system have?
60 (hyperthreads).

How much memory does the system have?
Approximately 200G (please let me know if a more precise figure would be helpful here and I can find it)

And how loaded is the system?
Do you have any specific measures in mind? Generally speaking (in terms of memory, CPU, IO, etc), the system was under significant, but manageable load prior to this incident.

Bill F (bill-f-666) wrote :

Sorry, I’m new to this forum and clearly used suboptimal markdown in my previous comment...and forgot to use quote markdown altogether for the last few questions 😞. But hopefully that didn’t substantively reduce the clarity of the comment.

Bill F (bill-f-666) wrote :

Sorry #2: I began writing my response(s) before your last couple comments had appeared in my browser window, and it’s just about my bedtime now. Will respond to your other inquiries first thing in the morning.

And thanks again for your responses! There are certainly limits to how much info I’ll be able to share in this case, but I’d definitely like to work to provide you with whatever info you need to understand this behavior — and I suspect I can quickly do some scripting to obfuscate any PII in my config before sharing it with you.

Bill F (bill-f-666) wrote :

Ok, less sleepy than expected.

> Are you seeing cache messages being logged?
No, but then we weren't passing any flags to apparmor_parser aside from -r, and the only option we had enabled in parser.conf was: Optimize=no-expr-simplify. FWIW, I did try running apparmor_parser with cache-logging flags on a non-production host (with generally similar config to the host from the incident described in my OP), and it indicated cache misses (via STDOUT) for each of the apache subprofiles...but there weren't any "same as current profile, skipping" messages, so I'm not convinced these ended up on the same codepaths.

> Would you be willing to add a flag to parser.conf that would log more info about caching of the complile?
Absolutely -- in fact, I plan to do so! But to be clear, the host on which the original incident occurred is not the kind of host where I could run apparmor_parser just for the sake of collecting no guarantees regarding the timing of the next AA parser run (on that particular host).

> Does the exact same $profile_name get logged multiple times? If so this would indicate userspace is swamping the kernel with multiple replacement requests when not necessary.
Nope! Each profile is logged exactly once, and the order in which they are logged is lexicographic. That is, the logs look like this:

${time1} operation="profile_replace" info="same as current profile, skipping" logged for profile "a"
${time1_plus10_seconds} operation="profile_replace" info="same as current profile, skipping" logged for profile "b"
${time1_plus20_seconds} operation="profile_replace" info="same as current profile, skipping" logged for profile "c"

John Johansen (jjohansen) wrote :

looking into this more. Profiles with hats, are what will trigger this because the list of children is transferred whole sale from the old parent to the new. This forces an rcu synchronization on the profile replacement. But this is inside of a mutex lock blocking other replacement. Hence one profile gets replaced, and then there is a pause waiting for the rcu synch, and then the lock is released and the next profile replacement happens and blocks in the same way.

This part of the profile replacement is going to need to be redesigned. Because rcu is used to provide a lockless read side. This probably means creating new children instead of moving the children. This isn't as bad as it sounds as they can share the same state machines etc (the majority of the profile) via reference counting.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers