using perf can crash kernel with a stack overflow

Bug #1875941 reported by Colin Ian King
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Linux
Confirmed
Unknown
linux (Ubuntu)
In Progress
High
Colin Ian King
Focal
Fix Committed
Undecided
Unassigned

Bug Description

running sudo stress-ng --perf --cpu 1 -t 10 will cause the recent 5.4.0-25-generic kernel to lock up with no information on the console showing where it is locked up.

Bisected this back to:

commit d44d71bbb9618c526820b39fe1cd0673582dc8c4 (refs/bisect/bad)
Author: Joerg Roedel <email address hidden>
Date: Sat Mar 21 18:22:41 2020 -0700

    x86/mm: split vmalloc_sync_all()

    BugLink: https://bugs.launchpad.net/bugs/1869061

    commit 763802b53a427ed3cbd419dbba255c414fdd9e7c upstream.

Changed in linux (Ubuntu):
status: New → Triaged
importance: Undecided → Critical
assignee: nobody → Colin Ian King (colin-king)
status: Triaged → In Progress
Revision history for this message
Colin Ian King (colin-king) wrote :

This commit is revertable against the current focal head and resolves this issue.

Revision history for this message
Colin Ian King (colin-king) wrote :

Tried this on 5.5.19, 5.6.7 and 5.7-rc3 and it's still an issue.

Revision history for this message
Colin Ian King (colin-king) wrote :

The TLB tlb/tlb_flush perf event tickles this bug, disabling it won't trip the issue.

Revision history for this message
Colin Ian King (colin-king) wrote :

Attaching strace, managed to capture this crash report before a the hang.

Revision history for this message
Colin Ian King (colin-king) wrote :

OK, this seems to only happen when using exceptions/page_fault_user, exceptions/page_fault_kernel and tlb/tlb_flush together.

One can reproduce this with perf:

sudo perf record -e exceptions:page_fault_user,exceptions:page_fault_kernel,tlb:tlb_flush sleep 1

Revision history for this message
In , colin.king (colin.king-linux-kernel-bugs) wrote :

originally I triggered this with stress-ng on V0.11.08 running sudo stress-ng --perf --cpu 1 -t 10

I've pushed a commit since to not use the TLB flush event to avoid this issue for the moment.

I've worked through all the perf event combinations and found that the kernel panic occurs with the following events:

sudo perf record -eexceptions:page_fault_user,exceptions:page_fault_kernel,tlb:tlb_flush sleep 1

Bisecting the kernel I found that this issue occurred when the following commit landed in the kernel:

commit 763802b53a427ed3cbd419dbba255c414fdd9e7c
Author: Joerg Roedel <email address hidden>
Date: Sat Mar 21 18:22:41 2020 -0700

    x86/mm: split vmalloc_sync_all()

This is a 100% reproducer, always happes on x86-64 in VM and on hardware.

Revision history for this message
In , colin.king (colin.king-linux-kernel-bugs) wrote :

Created attachment 288837
full stack dump

Top of stack dump (attached) shows it's a stack overflow

[ 22.163398] BUG: stack guard page was hit at (____ptrval____) (stack is (____ptrval____)..(____ptrval____))
[ 22.165204] kernel stack overflow (double-fault): 0000 [#1] SMP PTI
[ 22.166729] CPU: 3 PID: 935 Comm: perf Not tainted 5.4.0-28-generic #32-Ubuntu
[ 22.168813] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1 04/01/2014
[ 22.171263] RIP: 0010:perf_trace_x86_exceptions+0x44/0xf0
[ 22.172769] Code: 83 ec 18 48 8b 5f 78 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 65 48 03 1d 00 0c f9 68 48 8b 87 80 00 00 00 48 85 c0 75 08 <48> 8b 03 48 85 c0 74 74 bf 24 00 00 00 48 8d 55 c4 48 8d 75 c8 e8
[ 22.176573] RSP: 0018:ffff978f00838020 EFLAGS: 00010046
[ 22.177569] RAX: 0000000000000000 RBX: ffffb78effdcab70 RCX: 0000000000000000
[ 22.178800] RDX: ffff978f008380b8 RSI: ffffb78effdcab70 RDI: ffffffff9863e620
[ 22.179993] RBP: ffff978f00838060 R08: 0000000000000000 R09: 0000000000000000
[ 22.181188] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff9863e620
[ 22.182698] R13: 0000000000000000 R14: ffffb78effdcab70 R15: ffff978f008380b8
[ 22.184019] FS: 00007ff4818af780(0000) GS:ffff892b7db80000(0000) knlGS:0000000000000000
[ 22.185592] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 22.186732] CR2: ffff978f00837ff8 CR3: 000000007d5d8000 CR4: 00000000000006e0
[ 22.188100] Call Trace:
[ 22.188689] do_page_fault+0xca/0xe0
[ 22.189493] do_async_page_fault+0x39/0x70
[ 22.190388] async_page_fault+0x34/0x40
[ 22.191233] RIP: 0010:perf_trace_x86_exceptions+0x44/0xf0

Revision history for this message
Colin Ian King (colin-king) wrote :

Appears to be a stack overflow:

Top of stack dump (attached) shows it's a stack overflow

[ 22.163398] BUG: stack guard page was hit at (____ptrval____) (stack is (____ptrval____)..(____ptrval____))
[ 22.165204] kernel stack overflow (double-fault): 0000 [#1] SMP PTI
[ 22.166729] CPU: 3 PID: 935 Comm: perf Not tainted 5.4.0-28-generic #32-Ubuntu
[ 22.168813] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1 04/01/2014
[ 22.171263] RIP: 0010:perf_trace_x86_exceptions+0x44/0xf0
[ 22.172769] Code: 83 ec 18 48 8b 5f 78 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 65 48 03 1d 00 0c f9 68 48 8b 87 80 00 00 00 48 85 c0 75 08 <48> 8b 03 48 85 c0 74 74 bf 24 00 00 00 48 8d 55 c4 48 8d 75 c8 e8
[ 22.176573] RSP: 0018:ffff978f00838020 EFLAGS: 00010046
[ 22.177569] RAX: 0000000000000000 RBX: ffffb78effdcab70 RCX: 0000000000000000
[ 22.178800] RDX: ffff978f008380b8 RSI: ffffb78effdcab70 RDI: ffffffff9863e620
[ 22.179993] RBP: ffff978f00838060 R08: 0000000000000000 R09: 0000000000000000
[ 22.181188] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff9863e620
[ 22.182698] R13: 0000000000000000 R14: ffffb78effdcab70 R15: ffff978f008380b8
[ 22.184019] FS: 00007ff4818af780(0000) GS:ffff892b7db80000(0000) knlGS:0000000000000000
[ 22.185592] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 22.186732] CR2: ffff978f00837ff8 CR3: 000000007d5d8000 CR4: 00000000000006e0
[ 22.188100] Call Trace:
[ 22.188689] do_page_fault+0xca/0xe0
[ 22.189493] do_async_page_fault+0x39/0x70
[ 22.190388] async_page_fault+0x34/0x40
[ 22.191233] RIP: 0010:perf_trace_x86_exceptions+0x44/0xf0

Revision history for this message
In , colin.king (colin.king-linux-kernel-bugs) wrote :

Finally got a full stack dump:

Revision history for this message
In , colin.king (colin.king-linux-kernel-bugs) wrote :

still occurs on 5.7-rc2 and today's linux-next tip

Changed in linux (Ubuntu):
importance: Critical → High
summary: - using perf as root locks up focal kernel
+ using perf can crash kernel with a stack overflow
Changed in linux:
status: Unknown → Confirmed
Revision history for this message
In , joro (joro-linux-kernel-bugs) wrote :

Please have a look here:

https://<email address hidden>/

and

https://<email address hidden>/

This is likely the same issue.

Revision history for this message
Colin Ian King (colin-king) wrote :

https://<email address hidden>/

Revision history for this message
In , colin.king (colin.king-linux-kernel-bugs) wrote :

Yep, it definitely is the same issue.

Revision history for this message
Colin Ian King (colin-king) wrote :

Fixed in upstream commit:

commit 11f5efc3ab66284f7aaacc926e9351d658e2577b
Author: Steven Rostedt (VMware) <email address hidden>
Date: Wed May 6 10:36:18 2020 -0400

    tracing: Add a vmalloc_sync_mappings() for safe measure

Revision history for this message
Colin Ian King (colin-king) wrote :
Changed in linux (Ubuntu Focal):
status: New → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.