kdump fails to take dump with smt set to 2, hmc dumpstart
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
The Ubuntu-power-systems project |
Invalid
|
High
|
Canonical Kernel Team | ||
linux (Ubuntu) |
Invalid
|
High
|
Canonical Kernel Team | ||
Artful |
Invalid
|
High
|
Unassigned | ||
makedumpfile (Ubuntu) |
Invalid
|
High
|
Canonical Kernel Team | ||
Artful |
Invalid
|
High
|
Canonical Kernel Team |
Bug Description
== SRU Justification ==
IBM has requested these three commits in Artful. In Artful, kdump fails to
capture dump when smt=2 or off.
Including these three commits allows kdump to work properly.
== Fixes ==
4388c9b3a6ee ("powerpc: Do not send system reset request through the oops path")
04b9c96eae72 ("powerpc/crash: Remove the test for cpu_online in the IPI callback")
4552d128c26e ("powerpc: System reset avoid interleaving oops using die synchronisation")
== Regression Potential ==
Low. Fixes are limited to powerpc.
== Test Case ==
A test kernel was built with these patches and tested by the original bug reporter.
The bug reporter states the test kernel resolved the bug.
--Problem Description---
kdump fails to take dump with smt set to 2, hmc dumpstart
---Issue observed---
[ 0.004111] Oops: Exception in kernel mode, sig: 4 [#1]
[ 0.004118] SMP NR_CPUS=2048
[ 0.004120] NUMA
[ 0.004125] pSeries
[ 0.004132] Modules linked in:
[ 0.004142] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.13.0-12-generic #13-Ubuntu
[ 0.004153] task: c000000046715900 task.stack: c000000046134000
[ 0.004162] NIP: c000000000006468 LR: c00000000801764c CTR: 00000000006cdc70
[ 0.004173] REGS: c000000047fe3ce0 TRAP: 0700 Not tainted (4.13.0-12-generic)
[ 0.004181] MSR: 8000000000081031 <SF,ME,IR,DR,LE>
[ 0.004193] CR: 88042222 XER: 20000003
[ 0.004204] CFAR: c000000000006454 SOFTE: 0
[ 0.004204] GPR00: c00000000801764c c000000047fe3f60 c0000000095e3000 0000000000000000
[ 0.004204] GPR04: 0000000000000001 0000000000000002 ffffffffffffffff ffffffffffffffdf
[ 0.004204] GPR08: 0000000000000000 0000000028042222 0000000000000002 0000000000000002
[ 0.004204] GPR12: 0000000000000000 c00000000fff0000 c000000046137f90 000000000b5452d8
[ 0.004204] GPR16: fffffffffffffffd 00000000089ffd10 0000000001360000 000000000b55d378
[ 0.004204] GPR20: 0000000000000060 000000001eca0000 000000000a6c0000 0000000000000007
[ 0.004204] GPR24: 0000000000000000 0000000000000000 c000000009621ed0 0000000000000000
[ 0.004204] GPR28: 0000000000000000 c000000046134000 c000000046137c80 c000000009105df8
[ 0.004328] NIP [c000000000006468] 0xc000000000006468
[ 0.004338] LR [c00000000801764c] __do_irq+0x4c/0x1c0
[ 0.004345] Call Trace:
[ 0.004354] [c000000047fe3f60] [c00000000801764c] __do_irq+0x4c/0x1c0 (unreliable)
[ 0.004368] [c000000047fe3f90] [c00000000802ab70] call_do_
[ 0.004380] [c000000046137bc0] [c00000000801785c] do_IRQ+0x9c/0x130
[ 0.004393] [c000000046137c10] [c000000008008ac4] hardware_
[ 0.004409] --- interrupt: 501 at arch_local_
[ 0.004409] LR = arch_local_
[ 0.004423] [c000000046137f00] [0000000000000005] 0x5 (unreliable)
[ 0.004436] [c000000046137f20] [c000000008049824] start_secondary
[ 0.004450] [c000000046137f90] [c00000000800aa6c] start_secondary
[ 0.004460] Instruction dump:
[ 0.004467] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
[ 0.004484] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
[ 0.004506] ---[ end trace 3e5a2a9047ef3cd0 ]---
[ 0.004512]
[ 0.004518] Oops: Exception in kernel mode, sig: 4 [#2]
[ 0.004525] SMP NR_CPUS=2048
[ 0.004526] NUMA
[ 0.004532] pSeries
[ 0.004540] Modules linked in:
[ 0.004550] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G D 4.13.0-12-generic #13-Ubuntu
[ 0.004561] task: c000000009579f00 task.stack: c0000000095dc000
[ 0.004569] NIP: c000000000006460 LR: c0000000080b6e80 CTR: 0000000000000000
[ 0.004580] REGS: c0000000095dfb20 TRAP: 0700 Tainted: G D (4.13.0-12-generic)
[ 0.004589] MSR: 8000000000081031 <SF,ME,IR,DR,LE>
[ 0.004599] CR: 22002228 XER: 20000004
[ 0.004611] CFAR: c00000000000493c SOFTE: 0
[ 0.004611] GPR00: 0000000000000000 c0000000095dfda0 c0000000095e3000 0000000000000000
[ 0.004611] GPR04: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 0.004611] GPR08: 0000000000000000 0000000022002228 000000007fffffff 0000000000000008
[ 0.004611] GPR12: 000000000000ffff c00000000fff0a80 c000000c7e137f90 0000000009980600
[ 0.004611] GPR16: 000000001ec70000 0000000000000001 0000000000000000 0000000000000000
[ 0.004611] GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000007
[ 0.004611] GPR24: 0000000000000008 c000000008000000 0000000008000000 0000000000000000
[ 0.004611] GPR28: 0000000000000000 0000000000000008 c000000009621ed0 c000000009622354
[ 0.004729] NIP [c000000000006460] 0xc000000000006460
[ 0.004739] LR [c0000000080b6e80] pseries_
[ 0.004746] Call Trace:
[ 0.004756] [c0000000095dfda0] [c0000000095dfe90] init_thread_
[ 0.004771] [c0000000095dfe00] [c00000000801e314] arch_cpu_
[ 0.004784] [c0000000095dfe30] [c000000008c6b92c] default_
[ 0.004798] [c0000000095dfe50] [c00000000815da14] do_idle+0x244/0x320
[ 0.004810] [c0000000095dfea0] [c00000000815dd28] cpu_startup_
[ 0.004823] [c0000000095dfed0] [c00000000800d2dc] rest_init+
[ 0.004835] [c0000000095dff00] [c000000008fe40fc] start_kernel+
[ 0.004848] [c0000000095dff90] [c00000000800ab7c] start_here_
[ 0.004857] Instruction dump:
[ 0.004864] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
[ 0.004881] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
[ 0.004899] ---[ end trace 3e5a2a9047ef3cd1 ]---
[ 0.004906]
[ 3.949808] Kernel panic - not syncing: Fatal exception in interrupt
[ 4.179808] ---[ end Kernel panic - not syncing: Fatal exception in interrupt
When tried with maxcpus=1, following is observed.
[ 3992.056997] Modules linked in: async_tx raid6_pq raid1 raid0 multipath linear ibmvscsi(+) crc32c_vpmsum
[ 3992.136992] CPU: 1 PID: 207 Comm: modprobe Not tainted 4.13.0-12-generic #13-Ubuntu
[ 3992.166991] task: c000000043719e00 task.stack: c0000000437c8000
[ 3992.206994] NIP: c0000000086d2530 LR: c0000000086d46f0 CTR: 0000000000000013
[ 3992.246996] REGS: c0000000437cb260 TRAP: 0901 Not tainted (4.13.0-12-generic)
[ 3992.276994] MSR: 800000000280b033 <SF,VEC,
[ 3992.306995] CR: 24844442 XER: 20000000
[ 3992.366993] CFAR: c0000000086d2570 SOFTE: 1
[ 3992.366993] GPR00: ffffffffffffff68 c0000000437cb4e0 c0000000095e3000 c000000043c67e80
[ 3992.366993] GPR04: c000000043c67e80 c000000043c6bc00 ffffffffffffffed 39077b9925c55abe
[ 3992.366993] GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000060
[ 3992.366993] GPR12: ffffffffffffff00 c00000000fac0a80
[ 3992.546994] NIP [c0000000086d2530] mpihelp_
[ 3992.586990] LR [c0000000086d46f0] mpih_sqr_
[ 3992.606991] Call Trace:
[ 3992.617082] [c0000000437cb4e0] [c0000000086d48c4] mpih_sqr_
[ 3992.636996] [c0000000437cb560] [c0000000086d4844] mpih_sqr_
[ 3992.676996] [c0000000437cb5e0] [c0000000086d5778] mpi_powm+
[ 3992.716992] [c0000000437cb720] [c000000008619d40] _rsa_dec.
[ 3992.746996] [c0000000437cb760] [c00000000861a094] rsa_verify+
[ 3992.786994] [c0000000437cb7c0] [c00000000861af44] pkcs1pad_
[ 3992.856995] [c0000000437cb800] [c000000008631510] public_
[ 3992.896992] [c0000000437cb9a0] [c0000000086311d4] verify_
[ 3992.926997] [c0000000437cb9c0] [c000000008634690] pkcs7_validate_
[ 3992.976992] [c0000000437cba20] [c0000000082b2e30] verify_
[ 3993.036993] [c0000000437cbad0] [c0000000081c8414] mod_verify_
[ 3993.076996] [c0000000437cbb40] [c0000000081c5054] load_module+
[ 3993.116992] [c0000000437cbd30] [c0000000081c70b4] SyS_finit_
[ 3993.176992] [c0000000437cbe30] [c00000000800b184] system_
[ 3993.226990] Instruction dump:
[ 3993.237018] 39400000 7cc600d0 7cc607b4 7cc930f8 78c01f24 79290020 7c0c0378 39290001
[ 3993.336994] 7d2903a6 60000000 60000000 60420000 <7d6c0050> 38c60001 7cc607b4 7d25582a
[ 4028.156997] xor: measuring software checksum speed
[ 4029.376998] 8regs : 16.000 MB/sec
[ 4030.676992] 8regs_prefetch: 16.000 MB/sec
[ 4031.716993] 32regs : 16.000 MB/sec
[ 4032.886994] 32regs_prefetch: 16.000 MB/sec
[ 4034.256993] altivec : 16.000 MB/sec
[ 4034.316994] xor: using function: altivec (16.000 MB/sec)
[ 4076.016995] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [modprobe:207]
[ 4076.046994] Modules linked in: xor async_tx raid6_pq raid1 raid0 multipath linear ibmvscsi(+) crc32c_vpmsum
[ 4076.126994] CPU: 1 PID: 207 Comm: modprobe Tainted: G L 4.13.0-12-generic #13-Ubuntu
[ 4076.186995] task: c000000043719e00 task.stack: c0000000437c8000
[ 4076.226993] NIP: c0000000086d224c LR: c0000000086d4404 CTR: 0000000000000008
[ 4076.256991] REGS: c0000000437cb190 TRAP: 0901 Tainted: G L (4.13.0-12-generic)
[ 4076.286994] MSR: 800000000280b033 <SF,VEC,
[ 4076.326993] CR: 24884444 XER: 20000000
[ 4076.356998] CFAR: c0000000086d4400 SOFTE: 1
[ 4076.356998] GPR00: 5ebfd337ad53c297 c0000000437cb410 c0000000095e3000 c000000043c62910
[ 4076.356998] GPR04: c000000043c62800 fffffffffffffff8 00000000c68de1f2 0000000000000000
[ 4076.356998] GPR08: 761ab85da0153bf8 0000000000000008 0000000063cfb2b3 026231001e934591
[ 4076.356998] GPR12: 0000000000000038 c00000000fac0a80
[ 4076.556992] NIP [c0000000086d224c] mpihelp_
[ 4076.596990] LR [c0000000086d4404] mpih_sqr_
[ 4076.607012] Call Trace:
[ 4076.636994] [c0000000437cb410] [0000000000000901] 0x901 (unreliable)
[ 4076.676992] [c0000000437cb460] [c0000000086d4644] mpih_sqr_
[ 4076.736992] [c0000000437cb4e0] [c0000000086d4890] mpih_sqr_
[ 4076.756995] [c0000000437cb560] [c0000000086d4844] mpih_sqr_
[ 4076.816995] [c0000000437cb5e0] [c0000000086d5778] mpi_powm+
[ 4076.846996] [c0000000437cb720] [c000000008619d40] _rsa_dec.
[ 4076.896992] [c0000000437cb760] [c00000000861a094] rsa_verify+
[ 4076.946997] [c0000000437cb7c0] [c00000000861af44] pkcs1pad_
[ 4076.976996] [c0000000437cb800] [c000000008631510] public_
[ 4077.016993] [c0000000437cb9a0] [c0000000086311d4] verify_
[ 4077.046995] [c0000000437cb9c0] [c000000008634690] pkcs7_validate_
[ 4077.086997] [c0000000437cba20] [c0000000082b2e30] verify_
[ 4077.136995] [c0000000437cbad0] [c0000000081c8414] mod_verify_
[ 4077.196996] [c0000000437cbb40] [c0000000081c5054] load_module+
[ 4077.236996] [c0000000437cbd30] [c0000000081c70b4] SyS_finit_
[ 4077.286997] [c0000000437cbe30] [c00000000800b184] system_
[ 4077.337015] Instruction dump:
[ 4077.366995] 7ca507b4 78c60020 7ca928f8 78bf1f24 79290020 7ffdfb78 39290001 38e00000
[ 4077.426994] 7d2903a6 7b9c83e4 60000000 60000000 <60420000> 7d9df850 38a50001 7ca507b4
Contact Information = <email address hidden>
---uname output---
Linux ltcalpine-lp9 4.13.0-12-generic #13-Ubuntu SMP Fri Sep 22 20:52:52 UTC 2017 ppc64le ppc64le ppc64le GNU/Linux
Machine Type/Model = Power 8 pVM/8408-E8E
----Additional Info-----
# cat /proc/cmdline
BOOT_IMAGE=
---Steps to Reproduce---
1. installed linux-crashdump and install debug kernel
2. edited the kdump-tools.cfg crashkernel cmdline to above
3. update-grub
4. reboot once
5. make sure kdump is enabled
6. pp64_cpu --smt=2
7. Login to hmc and trigger dumpstart.
chsysstate -r lpar -m <Server-name> -n <lpar-name> -o dumprestart
soft lockup is observed when maxcpus=1 is used in kdump instead of nr_cpus=1. Dump is not taken and kernel boot stops.
The full log is attached.
Expected:
To take dump and boot back to the host kernel.
== Comment: #4 - Hari Krishna Bathini <email address hidden> - 2018-06-11 06:22:57 ==
The below upstream patches should resolve this issue:
https:/
("powerpc/crash: Remove the test for cpu_online in the IPI callback")
https:/
("powerpc: Do not send system reset request through the oops path")
https:/
("powerpc: System reset avoid interleaving oops using die synchronisation")
Thanks
Hari
Changed in ubuntu-power-systems: | |
importance: | Undecided → High |
assignee: | nobody → Canonical Kernel Team (canonical-kernel-team) |
tags: | added: ppc64el-kdump triage-g |
affects: | linux (Ubuntu) → makedumpfile (Ubuntu) |
Changed in makedumpfile (Ubuntu): | |
assignee: | Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Canonical Kernel Team (canonical-kernel-team) |
importance: | Undecided → High |
Changed in linux (Ubuntu Artful): | |
status: | New → In Progress |
Changed in makedumpfile (Ubuntu Artful): | |
assignee: | nobody → Canonical Kernel Team (canonical-kernel-team) |
Changed in linux (Ubuntu): | |
assignee: | nobody → Canonical Kernel Team (canonical-kernel-team) |
Changed in makedumpfile (Ubuntu Artful): | |
importance: | Undecided → High |
Changed in linux (Ubuntu): | |
importance: | Undecided → High |
Changed in ubuntu-power-systems: | |
status: | New → In Progress |
Changed in linux (Ubuntu): | |
status: | New → Incomplete |
status: | Incomplete → Invalid |
Changed in makedumpfile (Ubuntu): | |
status: | New → Incomplete |
Changed in makedumpfile (Ubuntu Artful): | |
status: | New → Incomplete |
tags: |
added: targetmilestone-inin1804 removed: targetmilestone-inin--- |
Changed in linux (Ubuntu Artful): | |
status: | In Progress → Invalid |
Changed in makedumpfile (Ubuntu): | |
status: | Incomplete → Invalid |
Changed in makedumpfile (Ubuntu Artful): | |
status: | Incomplete → Invalid |
Changed in ubuntu-power-systems: | |
status: | In Progress → Invalid |
tags: | added: cscc |
Default Comment by Bridge