Watchdog error about hard lockup

Bug #1842465 reported by bugproxy on 2019-09-03
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
High
Frank Heimes
linux (Ubuntu)
High
Canonical Kernel Team

Bug Description

---Problem Description---
Got a message from Watchdog about self-detected hard LOCKUP

---uname output---
Linux power 5.0.0-23-generic #24~18.04.1-Ubuntu SMP Mon Jul 29 16:08:34 UTC 2019 ppc64le ppc64le ppc64le GNU/Linux

---Additional Hardware Info---
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 4
Core(s) per socket: 16
Socket(s): 2
NUMA node(s): 6
Model: 2.2 (pvr 004e 1202)
Model name: POWER9, altivec supported
CPU max MHz: 3800.0000
CPU min MHz: 2300.0000
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 10240K
NUMA node0 CPU(s): 0-63
NUMA node8 CPU(s): 64-127
NUMA node252 CPU(s):
NUMA node253 CPU(s):
NUMA node254 CPU(s):
NUMA node255 CPU(s):
---
free
              total used free shared buff/cache available
Mem: 1071807104 5110016 985192768 6229440 81504320 1056273664
Swap: 2097088 0 2097088
--
lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 1 894.3G 0 disk
??sda1 8:1 1 7M 0 part
??sda2 8:2 1 894.3G 0 part /
sdb 8:16 1 894.3G 0 disk
nvme0n1 259:1 0 2.9T 0 disk /nvmdisk1
---

Machine Type = AC922, bare metal

---Steps to Reproduce---
 This problem I encountered when running customer workload and I switched SMT levels from SMT2 to SMT1 and I got a
lockup error right away!! this seems to be a different one... postgresql DB daemon was running on the system.

Stack trace output:
 [756383.688067] watchdog: CPU 53 self-detected hard LOCKUP @ _raw_spin_lock+0x54/0xe0
[756383.688068] watchdog: CPU 53 TB:387344180861438, last heartbeat TB:387337108856720 (13812ms ago)
[756383.688069] Modules linked in: binfmt_misc veth ipt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter bpfilter xt_conntrack nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter bridge stp llc aufs overlay vmx_crypto ofpart cmdlinepart powernv_flash ipmi_powernv opal_prd mtd ipmi_devintf at24 ibmpowernv ipmi_msghandler uio_pdrv_genirq uio sch_fq_codel ib_iser rdma_cm iw_cm ib_cm iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear mlx5_ib ib_uverbs ib_core ast crct10dif_vpmsum i2c_algo_bit crc32c_vpmsum ttm mlx5_core drm_kms_helper syscopyarea nvme sysfillrect sysimgblt fb_sys_fops drm nvme_core ahci libahci tls mlxfw devlink tg3 drm_panel_orientation_quirks
[756383.688088] CPU: 53 PID: 119744 Comm: postgres Not tainted 5.0.0-23-generic #24~18.04.1-Ubuntu
[756383.688088] NIP: c000000000e0fcc4 LR: c00000000015fd90 CTR: c000000000600460
[756383.688089] REGS: c000007fffb3bd70 TRAP: 0900 Not tainted (5.0.0-23-generic)
[756383.688089] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 28242824 XER: 00000000
[756383.688091] CFAR: c000000000e0fcec IRQMASK: 1
[756383.688092] GPR00: c00000000015fd90 c000206f2cdf7970 c00000000185c700 c00020732ea49100
[756383.688093] GPR04: c000206f2cdf7a38 0000000000000000 c000206f2cdf7b00 0000000000000001
[756383.688095] GPR08: 0000000000000003 000000008000007d 0000000080000035 fffffffffffffffd
[756383.688096] GPR12: 0000000000002000 c000007ffffc5080 00007cde07504dd8 00000f495eee0d68
[756383.688097] GPR16: 00007fffc0eb2bd7 00007fffc0eb2aa0 00000f496c289088 00007fffc0eb2a74
[756383.688098] GPR20: 0000000000000000 0000000000000001 0000000000000001 0000000000000000
[756383.688099] GPR24: 0000000000000000 c000206f2cdf7a38 c000000001349100 000020732d700000
[756383.688100] GPR28: c000000001891c70 c000206f36d8b400 c000000001895c78 c00020732ea49100
[756383.688102] NIP [c000000000e0fcc4] _raw_spin_lock+0x54/0xe0
[756383.688102] LR [c00000000015fd90] __task_rq_lock+0x80/0x150
[756383.688102] Call Trace:
[756383.688103] [c000206f2cdf7970] [c000206f2cdf79d0] 0xc000206f2cdf79d0 (unreliable)
[756383.688103] [c000206f2cdf79a0] [c000007fd3847818] 0xc000007fd3847818
[756383.688104] [c000206f2cdf7a10] [c0000000001649c0] try_to_wake_up+0x380/0x710
[756383.688105] [c000206f2cdf7aa0] [c000000000164de0] wake_up_q+0x70/0xd0
[756383.688105] [c000206f2cdf7ae0] [c0000000005fab54] do_semtimedop+0x474/0xcc0
[756383.688106] [c000206f2cdf7d60] [c0000000005fc634] ksys_semtimedop+0xd4/0xf0
[756383.688107] [c000206f2cdf7dc0] [c00000000060047c] sys_ipc+0x14c/0x470
[756383.688107] [c000206f2cdf7e20] [c00000000000b288] system_call+0x5c/0x70
[756383.688108] Instruction dump:
[756383.688108] 40c20010 7d40192d 40c2fff0 7c2004ac 2fa90000 4d9e0020 fbc1fff0 3fc20004
[756383.688110] 3bde9578 fbe1fff8 7c7f1b78 f821ffd1 <7c210b78> e93e0000 75290010 41820014
[756386.336267] watchdog: CPU 53 became unstuck TB:387345536789288
[756386.336292] CPU: 53 PID: 330 Comm: migration/53 Not tainted 5.0.0-23-generic #24~18.04.1-Ubuntu
[756386.336294] Call Trace:
[756386.336301] [c000007fed49fb40] [c000000000dea90c] dump_stack+0xb0/0xf4 (unreliable)
[756386.336307] [c000007fed49fb80] [c0000000000342dc] wd_smp_clear_cpu_pending+0x41c/0x430
[756386.336311] [c000007fed49fc30] [c00000000022909c] multi_cpu_stop+0x14c/0x210
[756386.336313] [c000007fed49fc90] [c0000000002294bc] cpu_stopper_thread+0xfc/0x1e0
[756386.336317] [c000007fed49fd40] [c000000000157d00] smpboot_thread_fn+0x270/0x2c0
[756386.336321] [c000007fed49fdb0] [c000000000151608] kthread+0x1a8/0x1b0
[756386.336324] [c000007fed49fe20] [c00000000000b65c] ret_from_kernel_thread+0x5c/0x80
[771875.432658] irq_migrate_all_off_this_cpu: 91 callbacks suppressed
[771875.432660] IRQ 110: no longer affine to CPU1
[771875.432694] IRQ 194: no longer affine to CPU1
[771875.498115] IRQ 192: no longer affine to CPU5
[771875.498124] IRQ 193: no longer affine to CPU5
[771875.498133] IRQ 201: no longer affine to CPU5
[771875.551051] IRQ 153: no longer affine to CPU9
[771875.551073] IRQ 229: no longer affine to CPU9
[771875.551149] IRQ 543: no longer affine to CPU9
[771875.602160] IRQ 199: no longer affine to CPU13
[771875.602170] IRQ 226: no longer affine to CPU13

== <email address hidden> ==
Also these false positives will probably be fixed by the commit

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=7ae3f6e130e8dc6188b59e3b4ebc2f16e9c8d053

which reads
From 7ae3f6e130e8dc6188b59e3b4ebc2f16e9c8d053 Mon Sep 17 00:00:00 2001
From: Nicholas Piggin <email address hidden>
Date: Tue, 9 Apr 2019 14:40:05 +1000
Subject: [PATCH] powerpc/watchdog: Use hrtimers for per-CPU heartbeat

Using a jiffies timer creates a dependency on the tick_do_timer_cpu
incrementing jiffies. If that CPU has locked up and jiffies is not
incrementing, the watchdog heartbeat timer for all CPUs stops and
creates false positives and confusing warnings on local CPUs, and
also causes the SMP detector to stop, so the root cause is never
detected.

Fix this by using hrtimer based timers for the watchdog heartbeat,
like the generic kernel hardlockup detector.

bugproxy (bugproxy) on 2019-09-03
tags: added: architecture-ppc64le bugnameltc-180737 severity-high targetmilestone-inin18041
Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → linux (Ubuntu)
Changed in ubuntu-power-systems:
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
importance: Undecided → High
Changed in linux (Ubuntu):
status: New → Confirmed
Changed in ubuntu-power-systems:
status: New → Confirmed
Changed in ubuntu-power-systems:
assignee: Canonical Kernel Team (canonical-kernel-team) → Frank Heimes (frank-heimes)
Changed in ubuntu-power-systems:
status: Confirmed → Fix Committed
Changed in linux (Ubuntu):
status: Confirmed → Fix Committed
Manoj Iyer (manjo) on 2019-09-12
Changed in linux (Ubuntu):
importance: Undecided → High
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Canonical Kernel Team (canonical-kernel-team)
Andrew Cloke (andrew-cloke) wrote :

Marked as "Fix Committed" as the patchset was picked up automatically by the latest 5.0 stable sync.

Frank Heimes (frank-heimes) wrote :

Patch landed in between in disco's release pocket, hence adjusting to Fix Released.

Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
Changed in ubuntu-power-systems:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers