Hard lockup in 2 CPUs due to deadlock in cpu_stoppers
Affects | Status | Importance | Assigned to | Milestone | ||
---|---|---|---|---|---|---|
linux (Ubuntu) | ||||||
Xenial |
Fix Released
|
Undecided
|
Unassigned | |||
Bionic |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
[Impact]
* This problem hard locks up 2 CPUs in a deadlock, and this
soft locks up other CPUs as an effect; the system becomes
unusable.
* This is relatively rare / difficult to hit because it's a
corner case in scheduling/load balancing that needs timing
with CPU stopper code. And it needs SMP plus _NUMA_ system.
(but it can be hit with synthetic test case attached in LP.)
* Since SMP plus NUMA usually equals _servers_ it looks like
a good idea to prevent this bug / hard lockups / rebooting.
* The fix resolves the potential deadlock by removing one of
the calls required to deadlock from under the locked code.
[Test Case]
* There's a synthetic test case to reproduce this problem
(although without the stack traces - just a system hang)
attached to this LP bug.
* It uses kprobes/mdelay/cpu stopper calls to force the code
to execute and force the timing/locking condition to occur.
* $ sudo insmod kmod-stopper.ko
Some dmesg logging occurs, and systems either hangs or not.
See examples in comments.
[Regression Potential]
* These are patches to the cpu stop_machine.c code, and they
change a bit how it works; however, there are no upstream
fixes for these patches anymore and they are still the top
of the 'git log --oneline -- kernel/
* These patches have been verified with the synthetic test case
and 'stress-ng --class scheduler --sequential 0' (no regressions)
on guest with 2 CPUs and one physical system with 24 CPUs.
[Other Info]
* The patches are required on Xenial and later.
* There are 4 patches for Xenial, and 2 patches pending for Bionic.
* All patches are applied from Cosmic onwards.
[Original Description]
These 2 hard lockups happened all of a sudden in the logs, and many soft lockups occur after them as a fallout.
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.477086] NMI watchdog: Watchdog detected hard LOCKUP on cpu 10
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.483800] Modules linked in: <...>
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484066] CPU: 10 PID: 58 Comm: migration/10 Not tainted 4.4.0-116-generic #140~14.04.1-Ubuntu
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484068] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 02/17/2017
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484070] task: ffff883ff2a76200 ti: ffff883ff2110000 task.ti: ffff883ff2110000
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484071] RIP: 0010:[<
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484079] RSP: 0000:ffff883ff2
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484080] RAX: 0000000000000101 RBX: 0000000000000086 RCX: 0000000000000001
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484081] RDX: 0000000000000101 RSI: 0000000000000001 RDI: ffff881fff991ba8
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484083] RBP: ffff883ff2113c58 R08: 0000000000000101 R09: ffff883ff082e200
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484084] R10: 0000000000002e04 R11: 0000000000002e04 R12: ffff881fff997c60
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484085] R13: ffff881fff991ba8 R14: 0000000000000000 R15: ffff881fff997300
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484087] FS: 000000000000000
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484088] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484090] CR2: 00007f7caaa23020 CR3: 0000001f46740000 CR4: 0000000000160670
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484091] Stack:
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484092] ffff883ff2113c68 ffffffff811870eb ffff883ff2113c80 ffffffff81819907
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484094] ffff881fff991ba0 ffff883ff2113cb0 ffffffff8111c600 ffff881fff997300
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484096] ffff881fff997c90 ffff881ff03dd400 0000000000000000 ffff883ff2113cc0
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484098] Call Trace:
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484105] [<ffffffff81187
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484109] [<ffffffff81819
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484113] [<ffffffff8111c
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484116] [<ffffffff8111c
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484119] [<ffffffff810bb
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484122] [<ffffffff810bb
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484126] [<ffffffff81816
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484132] [<ffffffff810a2
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484134] [<ffffffff81816
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484136] [<ffffffff810a2
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484139] [<ffffffff8109f
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484141] [<ffffffff8109f
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484143] [<ffffffff81819
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484144] [<ffffffff8109f
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.644471] NMI watchdog: Watchdog detected hard LOCKUP on cpu 6
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651086] Modules linked in: <...>
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651342] CPU: 6 PID: 204932 Comm: ceph-osd Not tainted 4.4.0-116-generic #140~14.04.1-Ubuntu
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651344] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 02/17/2017
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651345] task: ffff881ff03dd400 ti: ffff883cda77c000 task.ti: ffff883cda77c000
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651347] RIP: 0010:[<
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651353] RSP: 0000:ffff883cda
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651354] RAX: 0000000000000001 RBX: ffff883ff2a76200 RCX: 0000000000000000
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651355] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff883ff2a768d4
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651356] RBP: ffff883cda77fab8 R08: 000000000000000a R09: ffff881ff03dd400
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651357] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000017300
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651359] R13: ffff883ff2a768d4 R14: 0000000000000046 R15: 0000000000000000
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651360] FS: 00007ff8ecbc970
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651362] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651363] CR2: 0000000014583550 CR3: 0000003d4ac96000 CR4: 0000000000160670
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651364] Stack:
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651365] 0000000000000202 ffff883cda77fa98 0000000000000003 0000000000000006
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651368] 000000000000000a ffff883cda77fb70 ffff883fff011ba0 ffff881fff991ba0
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651370] ffff883cda77fac8 ffffffff810ab035 ffff883cda77fbc8 ffffffff8111cc22
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651372] Call Trace:
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651375] [<ffffffff810ab
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651379] [<ffffffff8111c
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651382] [<ffffffff8111c
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651384] [<ffffffff810b5
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651386] [<ffffffff8111c
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651388] [<ffffffff810aa
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651390] [<ffffffff810ab
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651392] [<ffffffff810b4
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651394] [<ffffffff810b5
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651396] [<ffffffff810b9
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651400] [<ffffffff8175e
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651404] [<ffffffff811da
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651408] [<ffffffff811b7
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651411] [<ffffffff816f6
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651413] [<ffffffff811b9
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651417] [<ffffffff81069
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651419] [<ffffffff8106a
Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651423] [<ffffffff8181c
Changed in linux (Ubuntu Xenial): | |
status: | Confirmed → Fix Committed |
Changed in linux (Ubuntu Bionic): | |
status: | Confirmed → Fix Committed |
Analysis
--------
The 1st hard lockup is harder to get the interesting data out of, as apparently the registers with variables
related to the cpu number have been clobbered by more recent calls in the spinlock path.
Looking at the 2nd hard lockup:
addr2line + code shows us that try_to_wake_up() in line 1997 is indeed looping with IRQs disabled in line 1939 (thus a hard lockup):
$ addr2line -pifae ddeb-116. 140/usr/ lib/debug/ boot/vmlinux- 4.4.0-116- generic 0xffffffff810aacb6 10aacb6: try_to_wake_up at /build/ linux-lts- xenial- ozsla7/ linux-lts- xenial- 4.4.0/kernel/ sched/core. c:1997
0xffffffff8
1926 static int wake_up( struct task_struct *p, unsigned int state, int wake_flags) lock_irqsave( &p->pi_ lock, flags); unlock_ irqrestore( &p->pi_ lock, flags);
1927 try_to_
1928 {
...
1939 raw_spin_
...
1993 /*
1994 * If the owning (remote) cpu is still in the middle of schedule() with
1995 * this task as prev, wait until its done referencing the task.
1996 */
1997 while (p->on_cpu)
1998 cpu_relax();
...
2027 raw_spin_
2028
2029 return success;
2030 }
The objdump disassembly of try_to_wake_up() in vmlinux for the RIP instruction address (ffffffff810aacb6),
shows a while loop that just checks for non-zero 'p->on_cpu' and calls cpu_relax() (which translates to the 'pause' instruction):
ffffffff810 aacb1: f3 90 pause aacb3: 8b 43 28 mov 0x28(%rbx),%eax aacb6: 85 c0 test %eax,%eax aacb8: 75 f7 jne ffffffff810aacb1 <try_to_ wake_up+ 0x81>
ffffffff810
ffffffff810
ffffffff810
So, it checks for the value in pointer in RBX + offset 0x28, which according to the 'pahole' tool, is indeed the 'on_cpu' field:
$ pahole --hex -C task_struct ddeb-116. 140/usr/ lib/debug/ boot/vmlinux- 4.4.0-116- generic | grep on_cpu
int on_cpu; /* 0x28 0x4 */
So, the task_struct pointer is in RBX, which is:
RBX: ffff883ff2a76200
And that matches the other hard locked up task on CPU 10 (see its 'task:' field).
Per the stack trace in CPU 10, and the identical timestamp of the two hard lockup messages, and the fact both stack traces are cpu_stopper related,
it does look like CPU 10 is waiting on the spinlock of one of the 2 cpu stoppers held by CPU 6, which is exactly the scenario in the suggested patch.
The problem/fix has been verified with a synthetic test-case (attached).
commit 0b26351b910fb8f e6a056f8a1bbcca be50c0e19f
Author: Peter Zijlstra <email address hidden>
Date: Fri Apr 20 11:50:05 2018 +0200
stop_machine, sched: Fix migrate_swap() vs. active_balance() deadlock
Matt reported the following deadlock:
CPU0 CPU1
schedule( .prev=migrate/ 0) <fault> next_task( ) ...
idle_balance( ) migrate_swap()
active_ balance( ) stop_two_cpus()
...
pick_