AppArmor causing the kernel to spend 30% of its time in a spin-lock

Bug #1993330 reported by Thomas Mangin
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
AppArmor
New
Undecided
Unassigned

Bug Description

Hi,

We run some proxies, written in Go. pprof returned that all calls were spent in network I/O write.

Looking into the issue with

echo "function_graph" > /sys/kernel/debug/tracing/current_tracer
echo 1 > /sys/kernel/debug/tracing/tracing_on
cat /sys/kernel/debug/tracing/trace > trace
echo "00000000,00000000,01000000" > /sys/kernel/debug/tracing/tracing_cpumask

The kernel function do_syscall_64 is used to perform the IO calls. The data indicated that most calls returned within a perfectly reasonable time but that some calls inexplicably took 10s to 100s of milliseconds, matching what we'd already observed with the output of perf.

Further analysis showed an interesting pattern that occurred in "slow" calls:

kprobe_ftrace_handler() {
 93) 0.140 us | get_kprobe();
 93) | pre_handler_kretprobe() {
 93) 0.150 us | _raw_spin_lock_irqsave();
 93) 0.150 us | _raw_spin_unlock_irqrestore();
 93) 0.150 us | arch_prepare_kretprobe();
 93) | _raw_spin_lock_irqsave() {
 93) # 1459.327 us | native_queued_spin_lock_slowpath();
 93) # 1460.008 us | }
 93) | _raw_spin_unlock_irqrestore() {```
and

 79) | trampoline_handler() {
 79) 0.140 us | kprobe_busy_begin();
 79) | kretprobe_hash_lock() {
 79) 0.150 us | _raw_spin_lock_irqsave();
 79) 0.411 us | }
 79) 0.140 us | percpu_array_map_lookup_elem();
 79) 0.140 us | percpu_array_map_lookup_elem();
 79) 0.140 us | bpf_get_current_pid_tgid();
 79) | __htab_map_lookup_elem() {
 79) 0.150 us | lookup_nulls_elem_raw();
 79) 0.652 us | }
 79) | htab_map_update_elem() {
 79) | _raw_spin_lock_irqsave() {
 79) # 1433.058 us | native_queued_spin_lock_slowpath();
 79) # 1434.020 us | }

The CPU core was busy acquiring a spinlock for about 1/3 of the available time. The proportion of 'wasted' time grew as the traffic increased and the problem worsened, and threads become more likely to vie for the shared resource.

Once the server reached 2 Gbps of traffic, the machine was unable to handle traffic any more.

Removing AppArmor and rebooting the server fixed the issue.

Revision history for this message
John Johansen (jjohansen) wrote :

What kernel version?

How many cpus on the system?

A rough guess of tasks/threads would be nice if its even possible to get an estimate.

Revision history for this message
Thomas Mangin (thomas-mangin) wrote :

We saw the issue with AMD Epyc, 1st 2nd and 3rd generation. 16 or 24 cores per CPU, with hyperthreading enabled.

Go has its own runtime, it runs one thread per core (GOMAXPROCS) and then the parallelism (goroutine) is handled by Go (sort of collaborative multitasking).

Handling 2Gbps of HTTP/HTTPS traffic maxes the CPU assigned to Go (half of them) and brings the service to a crawl. Not sure how many TCP connections but a lot, we run multiple go-routine per connection (at least 3 - again not looked at the code for years).

We are running Ubuntu 20.04.5 LTS and saw the issue with the last two kernels (sorry I am not in charge of the system and do not have SSH access anymore so I can not tell you exactly the revision).

Revision history for this message
Thomas Mangin (thomas-mangin) wrote :
Revision history for this message
Thomas Mangin (thomas-mangin) wrote :
Revision history for this message
Thomas Mangin (thomas-mangin) wrote (last edit ):

The "normal" time spend in "system" is normally very small, most of it is caused by the issue.

This information was part of a blog entry we created about the issue. I will seek to see if we have more information but it may be what I can provide.

summary: - AppArmor causing the kernel to send 90% of its time in a spin-lock
+ AppArmor causing the kernel to spend 90% of its time in a spin-lock
Revision history for this message
Thomas Mangin (thomas-mangin) wrote : Re: AppArmor causing the kernel to spend 90% of its time in a spin-lock

And sorry, the 90% of its time in spin-lock may have been over-stating the fact. 30% may be accurate, I will rename the subject.

description: updated
summary: - AppArmor causing the kernel to spend 90% of its time in a spin-lock
+ AppArmor causing the kernel to spend 30% of its time in a spin-lock
Revision history for this message
Thomas Mangin (thomas-mangin) wrote :

The problem was traced to kernel probes. It will also have been triggered by netdata, when enabling ebf monitoring, but removing AppArmor did help the situation at the time.

I have not been involved with the debugging of this issue, but the team did author two posts on the topic:
- https://technical.exa.net.uk/articles/2022-10-17-debuging-latency/
- https://technical.exa.net.uk/articles/2022-11-10-latency-redux/

As I do not know if AppArmor insert kernel probe, I will let someone with more skills decide whether this issue needs to be closed.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.