Watchdog error about hard lockup
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
The Ubuntu-power-systems project |
Fix Released
|
High
|
Frank Heimes | ||
linux (Ubuntu) |
Fix Released
|
High
|
Canonical Kernel Team |
Bug Description
---Problem Description---
Got a message from Watchdog about self-detected hard LOCKUP
---uname output---
Linux power 5.0.0-23-generic #24~18.04.1-Ubuntu SMP Mon Jul 29 16:08:34 UTC 2019 ppc64le ppc64le ppc64le GNU/Linux
---Additional Hardware Info---
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 4
Core(s) per socket: 16
Socket(s): 2
NUMA node(s): 6
Model: 2.2 (pvr 004e 1202)
Model name: POWER9, altivec supported
CPU max MHz: 3800.0000
CPU min MHz: 2300.0000
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 10240K
NUMA node0 CPU(s): 0-63
NUMA node8 CPU(s): 64-127
NUMA node252 CPU(s):
NUMA node253 CPU(s):
NUMA node254 CPU(s):
NUMA node255 CPU(s):
---
free
total used free shared buff/cache available
Mem: 1071807104 5110016 985192768 6229440 81504320 1056273664
Swap: 2097088 0 2097088
--
lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 1 894.3G 0 disk
??sda1 8:1 1 7M 0 part
??sda2 8:2 1 894.3G 0 part /
sdb 8:16 1 894.3G 0 disk
nvme0n1 259:1 0 2.9T 0 disk /nvmdisk1
---
Machine Type = AC922, bare metal
---Steps to Reproduce---
This problem I encountered when running customer workload and I switched SMT levels from SMT2 to SMT1 and I got a
lockup error right away!! this seems to be a different one... postgresql DB daemon was running on the system.
Stack trace output:
[756383.688067] watchdog: CPU 53 self-detected hard LOCKUP @ _raw_spin_
[756383.688068] watchdog: CPU 53 TB:387344180861438, last heartbeat TB:387337108856720 (13812ms ago)
[756383.688069] Modules linked in: binfmt_misc veth ipt_MASQUERADE nf_conntrack_
[756383.688088] CPU: 53 PID: 119744 Comm: postgres Not tainted 5.0.0-23-generic #24~18.04.1-Ubuntu
[756383.688088] NIP: c000000000e0fcc4 LR: c00000000015fd90 CTR: c000000000600460
[756383.688089] REGS: c000007fffb3bd70 TRAP: 0900 Not tainted (5.0.0-23-generic)
[756383.688089] MSR: 9000000000009033 <SF,HV,
[756383.688091] CFAR: c000000000e0fcec IRQMASK: 1
[756383.688092] GPR00: c00000000015fd90 c000206f2cdf7970 c00000000185c700 c00020732ea49100
[756383.688093] GPR04: c000206f2cdf7a38 0000000000000000 c000206f2cdf7b00 0000000000000001
[756383.688095] GPR08: 0000000000000003 000000008000007d 0000000080000035 fffffffffffffffd
[756383.688096] GPR12: 0000000000002000 c000007ffffc5080 00007cde07504dd8 00000f495eee0d68
[756383.688097] GPR16: 00007fffc0eb2bd7 00007fffc0eb2aa0 00000f496c289088 00007fffc0eb2a74
[756383.688098] GPR20: 0000000000000000 0000000000000001 0000000000000001 0000000000000000
[756383.688099] GPR24: 0000000000000000 c000206f2cdf7a38 c000000001349100 000020732d700000
[756383.688100] GPR28: c000000001891c70 c000206f36d8b400 c000000001895c78 c00020732ea49100
[756383.688102] NIP [c000000000e0fcc4] _raw_spin_
[756383.688102] LR [c00000000015fd90] __task_
[756383.688102] Call Trace:
[756383.688103] [c000206f2cdf7970] [c000206f2cdf79d0] 0xc000206f2cdf79d0 (unreliable)
[756383.688103] [c000206f2cdf79a0] [c000007fd3847818] 0xc000007fd3847818
[756383.688104] [c000206f2cdf7a10] [c0000000001649c0] try_to_
[756383.688105] [c000206f2cdf7aa0] [c000000000164de0] wake_up_q+0x70/0xd0
[756383.688105] [c000206f2cdf7ae0] [c0000000005fab54] do_semtimedop+
[756383.688106] [c000206f2cdf7d60] [c0000000005fc634] ksys_semtimedop
[756383.688107] [c000206f2cdf7dc0] [c00000000060047c] sys_ipc+0x14c/0x470
[756383.688107] [c000206f2cdf7e20] [c00000000000b288] system_
[756383.688108] Instruction dump:
[756383.688108] 40c20010 7d40192d 40c2fff0 7c2004ac 2fa90000 4d9e0020 fbc1fff0 3fc20004
[756383.688110] 3bde9578 fbe1fff8 7c7f1b78 f821ffd1 <7c210b78> e93e0000 75290010 41820014
[756386.336267] watchdog: CPU 53 became unstuck TB:387345536789288
[756386.336292] CPU: 53 PID: 330 Comm: migration/53 Not tainted 5.0.0-23-generic #24~18.04.1-Ubuntu
[756386.336294] Call Trace:
[756386.336301] [c000007fed49fb40] [c000000000dea90c] dump_stack+
[756386.336307] [c000007fed49fb80] [c0000000000342dc] wd_smp_
[756386.336311] [c000007fed49fc30] [c00000000022909c] multi_cpu_
[756386.336313] [c000007fed49fc90] [c0000000002294bc] cpu_stopper_
[756386.336317] [c000007fed49fd40] [c000000000157d00] smpboot_
[756386.336321] [c000007fed49fdb0] [c000000000151608] kthread+0x1a8/0x1b0
[756386.336324] [c000007fed49fe20] [c00000000000b65c] ret_from_
[771875.432658] irq_migrate_
[771875.432660] IRQ 110: no longer affine to CPU1
[771875.432694] IRQ 194: no longer affine to CPU1
[771875.498115] IRQ 192: no longer affine to CPU5
[771875.498124] IRQ 193: no longer affine to CPU5
[771875.498133] IRQ 201: no longer affine to CPU5
[771875.551051] IRQ 153: no longer affine to CPU9
[771875.551073] IRQ 229: no longer affine to CPU9
[771875.551149] IRQ 543: no longer affine to CPU9
[771875.602160] IRQ 199: no longer affine to CPU13
[771875.602170] IRQ 226: no longer affine to CPU13
== <email address hidden> ==
Also these false positives will probably be fixed by the commit
which reads
From 7ae3f6e130e8dc6
From: Nicholas Piggin <email address hidden>
Date: Tue, 9 Apr 2019 14:40:05 +1000
Subject: [PATCH] powerpc/watchdog: Use hrtimers for per-CPU heartbeat
Using a jiffies timer creates a dependency on the tick_do_timer_cpu
incrementing jiffies. If that CPU has locked up and jiffies is not
incrementing, the watchdog heartbeat timer for all CPUs stops and
creates false positives and confusing warnings on local CPUs, and
also causes the SMP detector to stop, so the root cause is never
detected.
Fix this by using hrtimer based timers for the watchdog heartbeat,
like the generic kernel hardlockup detector.
tags: | added: architecture-ppc64le bugnameltc-180737 severity-high targetmilestone-inin18041 |
Changed in ubuntu: | |
assignee: | nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) |
affects: | ubuntu → linux (Ubuntu) |
Changed in ubuntu-power-systems: | |
assignee: | nobody → Canonical Kernel Team (canonical-kernel-team) |
importance: | Undecided → High |
Changed in linux (Ubuntu): | |
status: | New → Confirmed |
Changed in ubuntu-power-systems: | |
status: | New → Confirmed |
Changed in ubuntu-power-systems: | |
assignee: | Canonical Kernel Team (canonical-kernel-team) → Frank Heimes (frank-heimes) |
Changed in ubuntu-power-systems: | |
status: | Confirmed → Fix Committed |
Changed in linux (Ubuntu): | |
status: | Confirmed → Fix Committed |
Changed in linux (Ubuntu): | |
importance: | Undecided → High |
assignee: | Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Canonical Kernel Team (canonical-kernel-team) |
Marked as "Fix Committed" as the patchset was picked up automatically by the latest 5.0 stable sync.