kdump fails to take dump with smt set to 2, hmc dumpstart

Bug #1776211 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Invalid
High
Canonical Kernel Team
linux (Ubuntu)
Invalid
High
Canonical Kernel Team
Artful
Invalid
High
Joseph Salisbury
makedumpfile (Ubuntu)
Invalid
High
Canonical Kernel Team
Artful
Invalid
High
Canonical Kernel Team

Bug Description

== SRU Justification ==
IBM has requested these three commits in Artful. In Artful, kdump fails to
capture dump when smt=2 or off.

Including these three commits allows kdump to work properly.

== Fixes ==
4388c9b3a6ee ("powerpc: Do not send system reset request through the oops path")
04b9c96eae72 ("powerpc/crash: Remove the test for cpu_online in the IPI callback")
4552d128c26e ("powerpc: System reset avoid interleaving oops using die synchronisation")

== Regression Potential ==
Low. Fixes are limited to powerpc.

== Test Case ==
A test kernel was built with these patches and tested by the original bug reporter.
The bug reporter states the test kernel resolved the bug.

--Problem Description---
kdump fails to take dump with smt set to 2, hmc dumpstart

---Issue observed---
[ 0.004111] Oops: Exception in kernel mode, sig: 4 [#1]
[ 0.004118] SMP NR_CPUS=2048
[ 0.004120] NUMA
[ 0.004125] pSeries
[ 0.004132] Modules linked in:
[ 0.004142] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.13.0-12-generic #13-Ubuntu
[ 0.004153] task: c000000046715900 task.stack: c000000046134000
[ 0.004162] NIP: c000000000006468 LR: c00000000801764c CTR: 00000000006cdc70
[ 0.004173] REGS: c000000047fe3ce0 TRAP: 0700 Not tainted (4.13.0-12-generic)
[ 0.004181] MSR: 8000000000081031 <SF,ME,IR,DR,LE>
[ 0.004193] CR: 88042222 XER: 20000003
[ 0.004204] CFAR: c000000000006454 SOFTE: 0
[ 0.004204] GPR00: c00000000801764c c000000047fe3f60 c0000000095e3000 0000000000000000
[ 0.004204] GPR04: 0000000000000001 0000000000000002 ffffffffffffffff ffffffffffffffdf
[ 0.004204] GPR08: 0000000000000000 0000000028042222 0000000000000002 0000000000000002
[ 0.004204] GPR12: 0000000000000000 c00000000fff0000 c000000046137f90 000000000b5452d8
[ 0.004204] GPR16: fffffffffffffffd 00000000089ffd10 0000000001360000 000000000b55d378
[ 0.004204] GPR20: 0000000000000060 000000001eca0000 000000000a6c0000 0000000000000007
[ 0.004204] GPR24: 0000000000000000 0000000000000000 c000000009621ed0 0000000000000000
[ 0.004204] GPR28: 0000000000000000 c000000046134000 c000000046137c80 c000000009105df8
[ 0.004328] NIP [c000000000006468] 0xc000000000006468
[ 0.004338] LR [c00000000801764c] __do_irq+0x4c/0x1c0
[ 0.004345] Call Trace:
[ 0.004354] [c000000047fe3f60] [c00000000801764c] __do_irq+0x4c/0x1c0 (unreliable)
[ 0.004368] [c000000047fe3f90] [c00000000802ab70] call_do_irq+0x14/0x24
[ 0.004380] [c000000046137bc0] [c00000000801785c] do_IRQ+0x9c/0x130
[ 0.004393] [c000000046137c10] [c000000008008ac4] hardware_interrupt_common+0x114/0x120
[ 0.004409] --- interrupt: 501 at arch_local_irq_restore+0x5c/0x90
[ 0.004409] LR = arch_local_irq_restore+0x40/0x90
[ 0.004423] [c000000046137f00] [0000000000000005] 0x5 (unreliable)
[ 0.004436] [c000000046137f20] [c000000008049824] start_secondary+0x324/0x350
[ 0.004450] [c000000046137f90] [c00000000800aa6c] start_secondary_prolog+0x10/0x14
[ 0.004460] Instruction dump:
[ 0.004467] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
[ 0.004484] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
[ 0.004506] ---[ end trace 3e5a2a9047ef3cd0 ]---
[ 0.004512]
[ 0.004518] Oops: Exception in kernel mode, sig: 4 [#2]
[ 0.004525] SMP NR_CPUS=2048
[ 0.004526] NUMA
[ 0.004532] pSeries
[ 0.004540] Modules linked in:
[ 0.004550] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G D 4.13.0-12-generic #13-Ubuntu
[ 0.004561] task: c000000009579f00 task.stack: c0000000095dc000
[ 0.004569] NIP: c000000000006460 LR: c0000000080b6e80 CTR: 0000000000000000
[ 0.004580] REGS: c0000000095dfb20 TRAP: 0700 Tainted: G D (4.13.0-12-generic)
[ 0.004589] MSR: 8000000000081031 <SF,ME,IR,DR,LE>
[ 0.004599] CR: 22002228 XER: 20000004
[ 0.004611] CFAR: c00000000000493c SOFTE: 0
[ 0.004611] GPR00: 0000000000000000 c0000000095dfda0 c0000000095e3000 0000000000000000
[ 0.004611] GPR04: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 0.004611] GPR08: 0000000000000000 0000000022002228 000000007fffffff 0000000000000008
[ 0.004611] GPR12: 000000000000ffff c00000000fff0a80 c000000c7e137f90 0000000009980600
[ 0.004611] GPR16: 000000001ec70000 0000000000000001 0000000000000000 0000000000000000
[ 0.004611] GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000007
[ 0.004611] GPR24: 0000000000000008 c000000008000000 0000000008000000 0000000000000000
[ 0.004611] GPR28: 0000000000000000 0000000000000008 c000000009621ed0 c000000009622354
[ 0.004729] NIP [c000000000006460] 0xc000000000006460
[ 0.004739] LR [c0000000080b6e80] pseries_lpar_idle+0x30/0x50
[ 0.004746] Call Trace:
[ 0.004756] [c0000000095dfda0] [c0000000095dfe90] init_thread_union+0x3e90/0x4000 (unreliable)
[ 0.004771] [c0000000095dfe00] [c00000000801e314] arch_cpu_idle+0x54/0x160
[ 0.004784] [c0000000095dfe30] [c000000008c6b92c] default_idle_call+0x4c/0x7c
[ 0.004798] [c0000000095dfe50] [c00000000815da14] do_idle+0x244/0x320
[ 0.004810] [c0000000095dfea0] [c00000000815dd28] cpu_startup_entry+0x38/0x50
[ 0.004823] [c0000000095dfed0] [c00000000800d2dc] rest_init+0xec/0x110
[ 0.004835] [c0000000095dff00] [c000000008fe40fc] start_kernel+0x584/0x5a4
[ 0.004848] [c0000000095dff90] [c00000000800ab7c] start_here_common+0x1c/0x520
[ 0.004857] Instruction dump:
[ 0.004864] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
[ 0.004881] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
[ 0.004899] ---[ end trace 3e5a2a9047ef3cd1 ]---
[ 0.004906]
[ 3.949808] Kernel panic - not syncing: Fatal exception in interrupt
[ 4.179808] ---[ end Kernel panic - not syncing: Fatal exception in interrupt

When tried with maxcpus=1, following is observed.

[ 3992.056997] Modules linked in: async_tx raid6_pq raid1 raid0 multipath linear ibmvscsi(+) crc32c_vpmsum
[ 3992.136992] CPU: 1 PID: 207 Comm: modprobe Not tainted 4.13.0-12-generic #13-Ubuntu
[ 3992.166991] task: c000000043719e00 task.stack: c0000000437c8000
[ 3992.206994] NIP: c0000000086d2530 LR: c0000000086d46f0 CTR: 0000000000000013
[ 3992.246996] REGS: c0000000437cb260 TRAP: 0901 Not tainted (4.13.0-12-generic)
[ 3992.276994] MSR: 800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>
[ 3992.306995] CR: 24844442 XER: 20000000
[ 3992.366993] CFAR: c0000000086d2570 SOFTE: 1
[ 3992.366993] GPR00: ffffffffffffff68 c0000000437cb4e0 c0000000095e3000 c000000043c67e80
[ 3992.366993] GPR04: c000000043c67e80 c000000043c6bc00 ffffffffffffffed 39077b9925c55abe
[ 3992.366993] GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000060
[ 3992.366993] GPR12: ffffffffffffff00 c00000000fac0a80
[ 3992.546994] NIP [c0000000086d2530] mpihelp_add_n+0x30/0x80
[ 3992.586990] LR [c0000000086d46f0] mpih_sqr_n+0x230/0x460
[ 3992.606991] Call Trace:
[ 3992.617082] [c0000000437cb4e0] [c0000000086d48c4] mpih_sqr_n+0x404/0x460 (unreliable)
[ 3992.636996] [c0000000437cb560] [c0000000086d4844] mpih_sqr_n+0x384/0x460
[ 3992.676996] [c0000000437cb5e0] [c0000000086d5778] mpi_powm+0x678/0xe50
[ 3992.716992] [c0000000437cb720] [c000000008619d40] _rsa_dec.isra.1+0x80/0xc0
[ 3992.746996] [c0000000437cb760] [c00000000861a094] rsa_verify+0x94/0x140
[ 3992.786994] [c0000000437cb7c0] [c00000000861af44] pkcs1pad_verify+0xd4/0x160
[ 3992.856995] [c0000000437cb800] [c000000008631510] public_key_verify_signature+0x240/0x4b0
[ 3992.896992] [c0000000437cb9a0] [c0000000086311d4] verify_signature+0x64/0x90
[ 3992.926997] [c0000000437cb9c0] [c000000008634690] pkcs7_validate_trust+0x190/0x2c0
[ 3992.976992] [c0000000437cba20] [c0000000082b2e30] verify_pkcs7_signature+0xc0/0x1f0
[ 3993.036993] [c0000000437cbad0] [c0000000081c8414] mod_verify_sig+0x94/0x100
[ 3993.076996] [c0000000437cbb40] [c0000000081c5054] load_module+0x264/0x1fc0
[ 3993.116992] [c0000000437cbd30] [c0000000081c70b4] SyS_finit_module+0xc4/0x130
[ 3993.176992] [c0000000437cbe30] [c00000000800b184] system_call+0x58/0x6c
[ 3993.226990] Instruction dump:
[ 3993.237018] 39400000 7cc600d0 7cc607b4 7cc930f8 78c01f24 79290020 7c0c0378 39290001
[ 3993.336994] 7d2903a6 60000000 60000000 60420000 <7d6c0050> 38c60001 7cc607b4 7d25582a
[ 4028.156997] xor: measuring software checksum speed
[ 4029.376998] 8regs : 16.000 MB/sec
[ 4030.676992] 8regs_prefetch: 16.000 MB/sec
[ 4031.716993] 32regs : 16.000 MB/sec
[ 4032.886994] 32regs_prefetch: 16.000 MB/sec
[ 4034.256993] altivec : 16.000 MB/sec
[ 4034.316994] xor: using function: altivec (16.000 MB/sec)
[ 4076.016995] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [modprobe:207]
[ 4076.046994] Modules linked in: xor async_tx raid6_pq raid1 raid0 multipath linear ibmvscsi(+) crc32c_vpmsum
[ 4076.126994] CPU: 1 PID: 207 Comm: modprobe Tainted: G L 4.13.0-12-generic #13-Ubuntu
[ 4076.186995] task: c000000043719e00 task.stack: c0000000437c8000
[ 4076.226993] NIP: c0000000086d224c LR: c0000000086d4404 CTR: 0000000000000008
[ 4076.256991] REGS: c0000000437cb190 TRAP: 0901 Tainted: G L (4.13.0-12-generic)
[ 4076.286994] MSR: 800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>
[ 4076.326993] CR: 24884444 XER: 20000000
[ 4076.356998] CFAR: c0000000086d4400 SOFTE: 1
[ 4076.356998] GPR00: 5ebfd337ad53c297 c0000000437cb410 c0000000095e3000 c000000043c62910
[ 4076.356998] GPR04: c000000043c62800 fffffffffffffff8 00000000c68de1f2 0000000000000000
[ 4076.356998] GPR08: 761ab85da0153bf8 0000000000000008 0000000063cfb2b3 026231001e934591
[ 4076.356998] GPR12: 0000000000000038 c00000000fac0a80
[ 4076.556992] NIP [c0000000086d224c] mpihelp_addmul_1+0x4c/0xf0
[ 4076.596990] LR [c0000000086d4404] mpih_sqr_n_basecase+0xd4/0x190
[ 4076.607012] Call Trace:
[ 4076.636994] [c0000000437cb410] [0000000000000901] 0x901 (unreliable)
[ 4076.676992] [c0000000437cb460] [c0000000086d4644] mpih_sqr_n+0x184/0x460
[ 4076.736992] [c0000000437cb4e0] [c0000000086d4890] mpih_sqr_n+0x3d0/0x460
[ 4076.756995] [c0000000437cb560] [c0000000086d4844] mpih_sqr_n+0x384/0x460
[ 4076.816995] [c0000000437cb5e0] [c0000000086d5778] mpi_powm+0x678/0xe50
[ 4076.846996] [c0000000437cb720] [c000000008619d40] _rsa_dec.isra.1+0x80/0xc0
[ 4076.896992] [c0000000437cb760] [c00000000861a094] rsa_verify+0x94/0x140
[ 4076.946997] [c0000000437cb7c0] [c00000000861af44] pkcs1pad_verify+0xd4/0x160
[ 4076.976996] [c0000000437cb800] [c000000008631510] public_key_verify_signature+0x240/0x4b0
[ 4077.016993] [c0000000437cb9a0] [c0000000086311d4] verify_signature+0x64/0x90
[ 4077.046995] [c0000000437cb9c0] [c000000008634690] pkcs7_validate_trust+0x190/0x2c0
[ 4077.086997] [c0000000437cba20] [c0000000082b2e30] verify_pkcs7_signature+0xc0/0x1f0
[ 4077.136995] [c0000000437cbad0] [c0000000081c8414] mod_verify_sig+0x94/0x100
[ 4077.196996] [c0000000437cbb40] [c0000000081c5054] load_module+0x264/0x1fc0
[ 4077.236996] [c0000000437cbd30] [c0000000081c70b4] SyS_finit_module+0xc4/0x130
[ 4077.286997] [c0000000437cbe30] [c00000000800b184] system_call+0x58/0x6c
[ 4077.337015] Instruction dump:
[ 4077.366995] 7ca507b4 78c60020 7ca928f8 78bf1f24 79290020 7ffdfb78 39290001 38e00000
[ 4077.426994] 7d2903a6 7b9c83e4 60000000 60000000 <60420000> 7d9df850 38a50001 7ca507b4

Contact Information = <email address hidden>

---uname output---
Linux ltcalpine-lp9 4.13.0-12-generic #13-Ubuntu SMP Fri Sep 22 20:52:52 UTC 2017 ppc64le ppc64le ppc64le GNU/Linux

Machine Type/Model = Power 8 pVM/8408-E8E

----Additional Info-----
# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinux-4.13.0-12-generic root=UUID=861097e8-43d3-4335-83d3-6db421e20564 ro crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:4096M

---Steps to Reproduce---
1. installed linux-crashdump and install debug kernel
2. edited the kdump-tools.cfg crashkernel cmdline to above
3. update-grub
4. reboot once
5. make sure kdump is enabled
6. pp64_cpu --smt=2

7. Login to hmc and trigger dumpstart.
chsysstate -r lpar -m <Server-name> -n <lpar-name> -o dumprestart

soft lockup is observed when maxcpus=1 is used in kdump instead of nr_cpus=1. Dump is not taken and kernel boot stops.

The full log is attached.

Expected:
To take dump and boot back to the host kernel.

== Comment: #4 - Hari Krishna Bathini <email address hidden> - 2018-06-11 06:22:57 ==
The below upstream patches should resolve this issue:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=04b9c96eae72
("powerpc/crash: Remove the test for cpu_online in the IPI callback")

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4388c9b3a6ee
("powerpc: Do not send system reset request through the oops path")

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4552d128c26e
("powerpc: System reset avoid interleaving oops using die synchronisation")

Thanks
Hari

Revision history for this message
bugproxy (bugproxy) wrote : kdump-log

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-159691 severity-high targetmilestone-inin---
Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → linux (Ubuntu)
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
importance: Undecided → High
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
tags: added: ppc64el-kdump triage-g
affects: linux (Ubuntu) → makedumpfile (Ubuntu)
Manoj Iyer (manjo)
Changed in makedumpfile (Ubuntu):
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Canonical Kernel Team (canonical-kernel-team)
importance: Undecided → High
Revision history for this message
Thadeu Lima de Souza Cascardo (cascardo) wrote :

Hari mentions linux commits, so I don't think there is anything for us to do on makedumpfile side here. I'll ask Joe to produce a kernel with those commits.

Cascardo.

Changed in linux (Ubuntu Artful):
importance: Undecided → High
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Artful):
status: New → In Progress
Manoj Iyer (manjo)
Changed in makedumpfile (Ubuntu Artful):
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
Changed in linux (Ubuntu):
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
Changed in makedumpfile (Ubuntu Artful):
importance: Undecided → High
Changed in linux (Ubuntu):
importance: Undecided → High
Changed in ubuntu-power-systems:
status: New → In Progress
Changed in linux (Ubuntu):
status: New → Incomplete
status: Incomplete → Invalid
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built an Artful test kernel with the three patches posted in the description. The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1776211

Can you test this kernel and see if it resolves this bug?

Note about installing test kernels:
• If the test kernel is prior to 4.15(Bionic) you need to install the linux-image and linux-image-extra .deb packages.
• If the test kernel is 4.15(Bionic) or newer, you need to install the linux-modules, linux-modules-extra and linux-image-unsigned .deb packages.

Thanks in advance!

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2018-06-12 11:38 EDT-------
FYI,
We do have this fix included for bionic already with launchpad bug 1758206

Changed in makedumpfile (Ubuntu):
status: New → Incomplete
Changed in makedumpfile (Ubuntu Artful):
status: New → Incomplete
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :
description: updated
bugproxy (bugproxy)
tags: added: targetmilestone-inin1804
removed: targetmilestone-inin---
Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

Looking at the kernel SRU mail archives, this has been NACK'ed for Artful as Artful will EOL shortly.

Changed in linux (Ubuntu Artful):
status: In Progress → Invalid
Changed in makedumpfile (Ubuntu):
status: Incomplete → Invalid
Changed in makedumpfile (Ubuntu Artful):
status: Incomplete → Invalid
Changed in ubuntu-power-systems:
status: In Progress → Invalid
Brad Figg (brad-figg)
tags: added: cscc
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.