Ubuntu 18.04 [ WSP DD2.2 with stop4 and stop5 enabled ]: kdump fails to capture dump when smt=2 or off.

Bug #1758206 reported by bugproxy on 2018-03-23
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
High
Canonical Kernel Team
linux (Ubuntu)
High
Joseph Salisbury
Bionic
High
Joseph Salisbury

Bug Description

---Problem Description---

Ubuntu 18.04 [ WSP DD2.2 with stop4 and stop5 enabled ]: kdump fails to capture dump when smt=2 or off.

---Environment--
Kernel Build: 4.15.0-13-generic
System Name : ltc-wspoon4
Model/Type : P9
Platform : BML

---Steps to reproduce--

1. Configure kdump.
2. Set smt=off
# ppc64_cpu --smt=off
3. trigger crash.
echo 1 > /proc/sys/kernel/sysrq
echo "c" > /proc/sysrq-trigger

---Logs----

root@ltc-wspoon4:~# dpkg -l|grep kexec
ii kexec-tools 1:2.0.16-1ubuntu1 ppc64el tools to support fast kexec reboots
root@ltc-wspoon4:~# makedumpfile -v
makedumpfile: version 1.6.3 (released on 29 Jun 2018)
lzo enabled
snappy disabled

[ 285.519832] [c000001fe2d83de0] [c0000000003d1898] SyS_write+0x68/0x110
[ 285.519926] [c000001fe2d83e30] [c00000000000b184] system_call+0x58/0x6c
[ 285.520007] Instruction dump:
[ 285.520053] 4bfff9f1 4bfffe50 3c4c00f0 3842e800 7c0802a6 60000000 39200001 3d42001c
[ 285.520158] 394a6db0 912a0000 7c0004ac 39400000 <992a0000> 4e800020 3c4c00f0 3842e7d0
[ 285.520261] ---[ end trace 90a666dc7ca6f0ec ]---
[ 286.525787]
[ 286.525883] Sending IPI to other CPUs
[ 28[ 401.296284048,5] OPAL: Switch to big-endian OS
[ 402.297026662,3] OPAL: CPU 0x1 not in OPAL !
6.851284] IPI complete
[ 403.455520784,3] OPAL: CPU 0x1 not in OPAL !nce.
[ 403.455569636,5] OPAL: Switch to little-endian OS
[ 404.455711332,3] OPAL: CPU 0x1 not in OPAL !
[ 404.470276386,3] PHB#0000[0:0]: CRESET: Unexpected slot state 00000102, resetting...
[ 413.140065625,3] PHB#0003[0:3]: CRESET: Unexpected slot state 00000102, resetting...
[ 421.393193605,3] PHB#0030[8:0]: CRESET: Unexpected slot state 00000102, resetting...
[ 423.353977316,3] PHB#0033[8:3]: CRESET: Unexpected slot state 00000102, resetting...
[ 425.314547966,3] PHB#0034[8:4]: CRESET: Unexpected slot state 00000102, resetting...

[ 5.004718] Processor 1 is stuck.
[ 10.007584] Processor 2 is stuck.
[ 15.010425] Processor 3 is stuck.
[ 16.135550] integrity: Unable to open file: /etc/keys/x509_ima.der (-2)
[ 16.135554] integrity: Unable to open file: /etc/keys/x509_evm.der (-2)
[ 16.250952] vio vio: uevent: failed to send synthetic uevent

--== Welcome to Hostboot hostboot-5fc3b52/hbicore.bin ==--

  4.52180|secure|SecureROM valid - enabling functionality
  4.53193|secure|Booting in non-secure mode.
  6.00924|Booting from SBE side 0 on master proc=00050000

There could be a firmware issue there but still there is need for the below kernel
patches to be included to ensure kdump kernel captures dump successfully
when SMT is set to 2/off

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=04b9c96eae72d862726f2f4bfcec2078240c33c5
("powerpc/crash: Remove the test for cpu_online in the IPI callback")

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4145f358644b970fcff293c09fdcc7939e8527d2
("powernv/kdump: Fix cases where the kdump kernel can get HMI's")

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=910961754572a2f4c83ad7e610d180
("powerpc/kdump: Fix powernv build break when KEXEC_CORE=n")

Thanks
Hari

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-165948 severity-high targetmilestone-inin1804
bugproxy (bugproxy) wrote :

Default Comment by Bridge

bugproxy (bugproxy) wrote : sosreport

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → linux (Ubuntu)
Changed in ubuntu-power-systems:
importance: Undecided → High
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
tags: added: triage-g
Changed in ubuntu-power-systems:
status: New → Triaged

------- Comment From <email address hidden> 2018-03-25 21:18 EDT-------
Can we get patched kernel for test to try this out.

Changed in linux (Ubuntu):
status: New → In Progress
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Joseph Salisbury (jsalisbury)
importance: Undecided → High
Joseph Salisbury (jsalisbury) wrote :

I built a Bionic test kernel with the three commits mentioned in the bug description. The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1758206

Can you test this kernel and see if it resolves this bug?

Note, to test this kernel, you need to install both the linux-image and linux-image-extra .deb packages.

Thanks in advance!

Changed in ubuntu-power-systems:
status: Triaged → In Progress
bugproxy (bugproxy) wrote :
Download full text (6.6 KiB)

------- Comment From <email address hidden> 2018-03-29 11:31 EDT-------
(In reply to comment #10)
> I built a Bionic test kernel with the three commits mentioned in the bug
> description. The test kernel can be downloaded from:
> http://kernel.ubuntu.com/~jsalisbury/lp1758206
>
> Can you test this kernel and see if it resolves this bug?
>
> Note, to test this kernel, you need to install both the linux-image and
> linux-image-extra .deb packages.
>
> Thanks in advance!

Tried with given kernel, kexec still failed. Please find logs below.

root@ltc-wspoon4:~# ppc64_cpu --smt
SMT is off
root@ltc-wspoon4:~# kdump-config show
DUMP_MODE: kdump
USE_KDUMP: 1
KDUMP_SYSCTL: kernel.panic_on_oops=1
KDUMP_COREDIR: /var/crash
crashkernel addr:
/var/lib/kdump/vmlinuz: symbolic link to /boot/vmlinux-4.15.0-12-generic
kdump initrd:
/var/lib/kdump/initrd.img: symbolic link to /var/lib/kdump/initrd.img-4.15.0-12-generic
current state: ready to kdump

kexec command:
/sbin/kexec -p --command-line="root=UUID=0266024d-8ea3-4132-ad62-b49befd6f8d9 ro quiet splash nr_cpus=1 systemd.unit=kdump-tools.service irqpoll noirqdistrib nousb" --initrd=/var/lib/kdump/initrd.img /var/lib/kdump/vmlinuz
root@ltc-wspoon4:~# echo "c" > /proc/sysrq-trigger
[ 951.567597] sysrq: SysRq : This sysrq operation is disabled.
root@ltc-wspoon4:~# echo 1 > /proc/sys/kernel/sysrq
root@ltc-wspoon4:~# echo "c" > /proc/sysrq-trigger
[ 968.396522] sysrq: SysRq : Trigger a crash
[ 968.396558] Unable to handle kernel paging request for data at address 0x00000000
[ 968.396602] Faulting instruction address: 0xc0000000007ec768
[ 968.396640] Oops: Kernel access of bad area, sig: 11 [#1]
[ 968.396670] LE SMP NR_CPUS=2048 NUMA PowerNV
[ 968.396703] Modules linked in: idt_89hpesx(E) at24 uio_pdrv_genirq ofpart cmdlinepart powernv_flash mtd uio ibmpowernv ipmi_powernv vmx_crypto ipmi_devintf ipmi_msghandler opal_prd crct10dif_vpmsum sch_fq_codel ip_tables x_tables autofs4 ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci crc32c_vpmsum drm tg3 libahci
[ 968.396893] CPU: 28 PID: 3086 Comm: bash Tainted: G E 4.15.0-12-generic #13~lp1758206
[ 968.396944] NIP: c0000000007ec768 LR: c0000000007ed6a8 CTR: c0000000007ec740
[ 968.396989] REGS: c0000000054fb9f0 TRAP: 0300 Tainted: G E (4.15.0-12-generic)
[ 968.397040] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 28222222 XER: 20040000
[ 968.397090] CFAR: c0000000007ed6a4 DAR: 0000000000000000 DSISR: 42000000 SOFTE: 1
[ 968.397090] GPR00: c0000000007ed6a8 c0000000054fbc70 c0000000016eaf00 0000000000000063
[ 968.397090] GPR04: c000001ff76bce18 c000001ff76d4368 9000000000009033 000000000000000a
[ 968.397090] GPR08: 0000000000000007 0000000000000001 0000000000000000 9000000000001003
[ 968.397090] GPR12: c0000000007ec740 c000000007a33400 00000a463c88ae48 0000000000000000
[ 968.397090] GPR16: 00000a462439e9f0 00000a4624431998 00000a46244319d0 00000a4624468204
[ 968.397090] GPR20: 0000000000000000 0000000000000001 0000000000000000 00007ffff9ecd164
[ 968.397090] GPR24: 00007ffff9ecd160 00000a462446afc4 c0000000015e9968 0000000000000002
[ 968.397090] GPR28: 0...

Read more...

Joseph Salisbury (jsalisbury) wrote :

Can you confirm the appropriate test kernel was booted with 'uname -a'? You should see the text 'lp1758206' in the kernel name.

Also, could there be more the the three patches required that were posted to the description:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=04b9c96eae72d862726f2f4bfcec2078240c33c5
("powerpc/crash: Remove the test for cpu_online in the IPI callback")

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4145f358644b970fcff293c09fdcc7939e8527d2
("powernv/kdump: Fix cases where the kdump kernel can get HMI's")

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=910961754572a2f4c83ad7e610d180
("powerpc/kdump: Fix powernv build break when KEXEC_CORE=n")

bugproxy (bugproxy) wrote :
Download full text (14.6 KiB)

------- Comment From <email address hidden> 2018-03-30 01:14 EDT-------
Tested again with given kernel, dump capture is successful with smt=2 and smt=off.

Sorry fr the wrong update in previous comment, not sure what i had missed yesterday.

root@ltc-wspoon4:~# uname -a
Linux ltc-wspoon4 4.15.0-12-generic #13~lp1758206 SMP Tue Mar 27 15:20:59 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux
root@ltc-wspoon4:~# ppc64_cpu --smt=off
root@ltc-wspoon4:~#
root@ltc-wspoon4:~# echo 1 > /proc/sys/kernel/sysrq
root@ltc-wspoon4:~# echo "c" > /proc/sysrq-trigger
[ 1424.806117] sysrq: SysRq : Trigger a crash
[ 1424.806163] Unable to handle kernel paging request for data at address 0x00000000
[ 1424.806267] Faulting instruction address: 0xc0000000007ec768
[ 1424.806352] Oops: Kernel access of bad area, sig: 11 [#1]
[ 1424.806424] LE SMP NR_CPUS=2048 NUMA PowerNV
[ 1424.806483] Modules linked in: idt_89hpesx(E) at24 ofpart uio_pdrv_genirq cmdlinepart powernv_flash uio mtd opal_prd ipmi_powernv ipmi_devintf ibmpowernv vmx_crypto ipmi_msghandler crct10dif_vpmsum sch_fq_codel ip_tables x_tables autofs4 ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci crc32c_vpmsum drm tg3 libahci
[ 1424.806828] CPU: 0 PID: 3110 Comm: bash Tainted: G E 4.15.0-12-generic #13~lp1758206
[ 1424.806963] NIP: c0000000007ec768 LR: c0000000007ed6a8 CTR: c0000000007ec740
[ 1424.807075] REGS: c000001fce3d39f0 TRAP: 0300 Tainted: G E (4.15.0-12-generic)
[ 1424.807211] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 28222222 XER: 20040000
[ 1424.807325] CFAR: c0000000007ed6a4 DAR: 0000000000000000 DSISR: 42000000 SOFTE: 1
[ 1424.807325] GPR00: c0000000007ed6a8 c000001fce3d3c70 c0000000016eaf00 0000000000000063
[ 1424.807325] GPR04: c000001ff6fbce18 c000001ff6fd4368 9000000000009033 000000000000000a
[ 1424.807325] GPR08: 0000000000000007 0000000000000001 0000000000000000 9000000000001003
[ 1424.807325] GPR12: c0000000007ec740 c000000007a20000 000006127f00ae48 0000000000000000
[ 1424.807325] GPR16: 000006124f78e9f0 000006124f821998 000006124f8219d0 000006124f858204
[ 1424.807325] GPR20: 0000000000000000 0000000000000001 0000000000000000 00007fffd6e57524
[ 1424.807325] GPR24: 00007fffd6e57520 000006124f85afc4 c0000000015e9968 0000000000000002
[ 1424.807325] GPR28: 0000000000000063 0000000000000004 c000000001572a9c c0000000015e9d08
[ 1424.808272] NIP [c0000000007ec768] sysrq_handle_crash+0x28/0x30
[ 1424.808364] LR [c0000000007ed6a8] __handle_sysrq+0xf8/0x2c0
[ 1424.808417] Call Trace:
[ 1424.808468] [c000001fce3d3c70] [c0000000007ed688] __handle_sysrq+0xd8/0x2c0 (unreliable)
[ 1424.808582] [c000001fce3d3d10] [c0000000007edeb4] write_sysrq_trigger+0x64/0x90
[ 1424.808690] [c000001fce3d3d40] [c00000000047dfe8] proc_reg_write+0x88/0xd0
[ 1424.808782] [c000001fce3d3d70] [c0000000003d131c] __vfs_write+0x3c/0x70
[ 1424.808875] [c000001fce3d3d90] [c0000000003d1578] vfs_write+0xd8/0x220
[ 1424.808957] [c000001fce3d3de0] [c0000000003d1898] SyS_write+0x68/0x110
[ 1424.809038] [c000001fce3d3e30] [c00000000000b184] system_call+0x58/0x6c
[ 1424.809139] Instruction dump:
[ 1424.809191] 4bfff9f1 4bfffe50 3c4c00f0 3842e7c0 7c0802a6 60000000 ...

Seth Forshee (sforshee) on 2018-03-30
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Changed in ubuntu-power-systems:
status: In Progress → Fix Committed
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-04-05 07:34 EDT-------
Issue is resolved in 4.15.0-15-generic kernel.

root@ltc-wspoon4:~# ppc64_cpu --smt
SMT is off

Starting Kernel crash dump capture service...
[ 11.747657] kdump-tools[952]: Starting kdump-tools: * running makedumpfile -c -d 31 /proc/vmcore /var/crash/201804050626/dump-incomplete
Copying data : [100.0 %] \ eta: 0s
[ 27.390223] kdump-tools[952]: The kernel version is not supported.
[ 27.390438] kdump-tools[952]: The makedumpfile operation may be incomplete.
[ 27.390563] kdump-tools[952]: The dumpfile is saved to /var/crash/201804050626/dump-incomplete.
[ 27.390726] kdump-tools[952]: makedumpfile Completed.
[ 27.405543] kdump-tools[952]: * kdump-tools: saved vmcore in /var/crash/201804050626
[ 30.762418] kdump-tools[952]: * running makedumpfile --dump-dmesg /proc/vmcore /var/crash/201804050626/dmesg.201804050626
[ 30.802776] kdump-tools[952]: The kernel version is not supported.
[ 30.802923] kdump-tools[952]: The makedumpfile operation may be incomplete.
[ 30.803025] kdump-tools[952]: The dmesg log is saved to /var/crash/201804050626/dmesg.201804050626.
[ 30.803145] kdump-tools[952]: makedumpfile Completed.
[ 30.803263] kdump-tools[952]: * kdump-tools: saved dmesg content in /var/crash/201804050626
[ 30.888353] kdump-tools[952]: Thu, 05 Apr 2018 06:26:24 -0500
[ 31.035631] kdump-tools[952]: Rebooting.
[ 31.126613] reboot: Restarting system
[ 1577.265030518,5] OPAL: Reboot request...

root@ltc-wspoon4:~# ppc64_cpu --smt
SMT=2

Starting Kernel crash dump capture service...
[ 13.378626] kdump-tools[952]: Starting kdump-tools: * running makedumpfile -c -d 31 /proc/vmcore /var/crash/201804050631/dump-incomplete
Copying data : [100.0 %] | eta: 0s
[ 27.102530] kdump-tools[952]: The kernel version is not supported.
[ 27.102659] kdump-tools[952]: The makedumpfile operation may be incomplete.
[ 27.102787] kdump-tools[952]: The dumpfile is saved to /var/crash/201804050631/dump-incomplete.
[ 27.102910] kdump-tools[952]: makedumpfile Completed.
[ 27.112064] kdump-tools[952]: * kdump-tools: saved vmcore in /var/crash/201804050631
[ 29.632162] kdump-tools[952]: * running makedumpfile --dump-dmesg /proc/vmcore /var/crash/201804050631/dmesg.201804050631
[ 29.672730] kdump-tools[952]: The kernel version is not supported.
[ 29.672890] kdump-tools[952]: The makedumpfile operation may be incomplete.
[ 29.672997] kdump-tools[952]: The dmesg log is saved to /var/crash/201804050631/dmesg.201804050631.
[ 29.673111] kdump-tools[952]: makedumpfile Completed.
[ 29.673249] kdump-tools[952]: * kdump-tools: saved dmesg content in /var/crash/201804050631
[ 29.774672] kdump-tools[952]: Thu, 05 Apr 2018 06:31:40 -0500
[ 29.913780] kdump-tools[952]: Rebooting.

Launchpad Janitor (janitor) wrote :
Download full text (40.4 KiB)

This bug was fixed in the package linux - 4.15.0-15.16

---------------
linux (4.15.0-15.16) bionic; urgency=medium

  * linux: 4.15.0-15.16 -proposed tracker (LP: #1761177)

  * FFe: Enable configuring resume offset via sysfs (LP: #1760106)
    - PM / hibernate: Make passing hibernate offsets more friendly

  * /dev/bcache/by-uuid links not created after reboot (LP: #1729145)
    - SAUCE: (no-up) bcache: decouple emitting a cached_dev CHANGE uevent

  * Ubuntu18.04:POWER9:DD2.2 - Unable to start a KVM guest with default machine
    type(pseries-bionic) complaining "KVM implementation does not support
    Transactional Memory, try cap-htm=off" (kvm) (LP: #1752026)
    - powerpc: Use feature bit for RTC presence rather than timebase presence
    - powerpc: Book E: Remove unused CPU_FTR_L2CSR bit
    - powerpc: Free up CPU feature bits on 64-bit machines
    - powerpc: Add CPU feature bits for TM bug workarounds on POWER9 v2.2
    - powerpc/powernv: Provide a way to force a core into SMT4 mode
    - KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9
    - KVM: PPC: Book3S HV: Work around XER[SO] bug in fake suspend mode
    - KVM: PPC: Book3S HV: Work around TEXASR bug in fake suspend state

  * Important Kernel fixes to be backported for Power9 (kvm) (LP: #1758910)
    - powerpc/mm: Fixup tlbie vs store ordering issue on POWER9

  * Ubuntu 18.04 - IO Hang on some namespaces when running HTX with 16
    namespaces (Bolt / NVMe) (LP: #1757497)
    - powerpc/64s: Fix lost pending interrupt due to race causing lost update to
      irq_happened

  * fwts-efi-runtime-dkms 18.03.00-0ubuntu1: fwts-efi-runtime-dkms kernel module
    failed to build (LP: #1760876)
    - [Packaging] include the retpoline extractor in the headers

linux (4.15.0-14.15) bionic; urgency=medium

  * linux: 4.15.0-14.15 -proposed tracker (LP: #1760678)

  * [Bionic] mlx4 ETH - mlnx_qos failed when set some TC to vendor
    (LP: #1758662)
    - net/mlx4_en: Change default QoS settings

  * AT_BASE_PLATFORM in AUXV is absent on kernels available on Ubuntu 17.10
    (LP: #1759312)
    - powerpc/64s: Fix NULL AT_BASE_PLATFORM when using DT CPU features

  * Bionic update to 4.15.15 stable release (LP: #1760585)
    - net: dsa: Fix dsa_is_user_port() test inversion
    - openvswitch: meter: fix the incorrect calculation of max delta_t
    - qed: Fix MPA unalign flow in case header is split across two packets.
    - tcp: purge write queue upon aborting the connection
    - qed: Fix non TCP packets should be dropped on iWARP ll2 connection
    - sysfs: symlink: export sysfs_create_link_nowarn()
    - net: phy: relax error checking when creating sysfs link netdev->phydev
    - devlink: Remove redundant free on error path
    - macvlan: filter out unsupported feature flags
    - net: ipv6: keep sk status consistent after datagram connect failure
    - ipv6: old_dport should be a __be16 in __ip6_datagram_connect()
    - ipv6: sr: fix NULL pointer dereference when setting encap source address
    - ipv6: sr: fix scheduling in RCU when creating seg6 lwtunnel state
    - mlxsw: spectrum_buffers: Set a minimum quota for CPU port traffic
    - net: phy: Tell caller result ...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Manoj Iyer (manjo) on 2018-04-23
Changed in ubuntu-power-systems:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers