Ubuntu 18.04 [ WSP DD2.2 with stop4 and stop5 enabled ]: kdump fails to capture dump when smt=2 or off.

Bug #1758206 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Fix Released
High
Canonical Kernel Team
linux (Ubuntu)
Fix Released
High
Joseph Salisbury
Bionic
Fix Released
High
Joseph Salisbury

Bug Description

---Problem Description---

Ubuntu 18.04 [ WSP DD2.2 with stop4 and stop5 enabled ]: kdump fails to capture dump when smt=2 or off.

---Environment--
Kernel Build: 4.15.0-13-generic
System Name : ltc-wspoon4
Model/Type : P9
Platform : BML

---Steps to reproduce--

1. Configure kdump.
2. Set smt=off
# ppc64_cpu --smt=off
3. trigger crash.
echo 1 > /proc/sys/kernel/sysrq
echo "c" > /proc/sysrq-trigger

---Logs----

root@ltc-wspoon4:~# dpkg -l|grep kexec
ii kexec-tools 1:2.0.16-1ubuntu1 ppc64el tools to support fast kexec reboots
root@ltc-wspoon4:~# makedumpfile -v
makedumpfile: version 1.6.3 (released on 29 Jun 2018)
lzo enabled
snappy disabled

[ 285.519832] [c000001fe2d83de0] [c0000000003d1898] SyS_write+0x68/0x110
[ 285.519926] [c000001fe2d83e30] [c00000000000b184] system_call+0x58/0x6c
[ 285.520007] Instruction dump:
[ 285.520053] 4bfff9f1 4bfffe50 3c4c00f0 3842e800 7c0802a6 60000000 39200001 3d42001c
[ 285.520158] 394a6db0 912a0000 7c0004ac 39400000 <992a0000> 4e800020 3c4c00f0 3842e7d0
[ 285.520261] ---[ end trace 90a666dc7ca6f0ec ]---
[ 286.525787]
[ 286.525883] Sending IPI to other CPUs
[ 28[ 401.296284048,5] OPAL: Switch to big-endian OS
[ 402.297026662,3] OPAL: CPU 0x1 not in OPAL !
6.851284] IPI complete
[ 403.455520784,3] OPAL: CPU 0x1 not in OPAL !nce.
[ 403.455569636,5] OPAL: Switch to little-endian OS
[ 404.455711332,3] OPAL: CPU 0x1 not in OPAL !
[ 404.470276386,3] PHB#0000[0:0]: CRESET: Unexpected slot state 00000102, resetting...
[ 413.140065625,3] PHB#0003[0:3]: CRESET: Unexpected slot state 00000102, resetting...
[ 421.393193605,3] PHB#0030[8:0]: CRESET: Unexpected slot state 00000102, resetting...
[ 423.353977316,3] PHB#0033[8:3]: CRESET: Unexpected slot state 00000102, resetting...
[ 425.314547966,3] PHB#0034[8:4]: CRESET: Unexpected slot state 00000102, resetting...

[ 5.004718] Processor 1 is stuck.
[ 10.007584] Processor 2 is stuck.
[ 15.010425] Processor 3 is stuck.
[ 16.135550] integrity: Unable to open file: /etc/keys/x509_ima.der (-2)
[ 16.135554] integrity: Unable to open file: /etc/keys/x509_evm.der (-2)
[ 16.250952] vio vio: uevent: failed to send synthetic uevent

--== Welcome to Hostboot hostboot-5fc3b52/hbicore.bin ==--

  4.52180|secure|SecureROM valid - enabling functionality
  4.53193|secure|Booting in non-secure mode.
  6.00924|Booting from SBE side 0 on master proc=00050000

There could be a firmware issue there but still there is need for the below kernel
patches to be included to ensure kdump kernel captures dump successfully
when SMT is set to 2/off

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=04b9c96eae72d862726f2f4bfcec2078240c33c5
("powerpc/crash: Remove the test for cpu_online in the IPI callback")

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4145f358644b970fcff293c09fdcc7939e8527d2
("powernv/kdump: Fix cases where the kdump kernel can get HMI's")

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=910961754572a2f4c83ad7e610d180
("powerpc/kdump: Fix powernv build break when KEXEC_CORE=n")

Thanks
Hari

Revision history for this message
bugproxy (bugproxy) wrote : console log

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-165948 severity-high targetmilestone-inin1804
Revision history for this message
bugproxy (bugproxy) wrote :

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : sosreport

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → linux (Ubuntu)
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
importance: Undecided → High
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
tags: added: triage-g
Changed in ubuntu-power-systems:
status: New → Triaged
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2018-03-25 21:18 EDT-------
Can we get patched kernel for test to try this out.

Changed in linux (Ubuntu):
status: New → In Progress
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Joseph Salisbury (jsalisbury)
importance: Undecided → High
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a Bionic test kernel with the three commits mentioned in the bug description. The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1758206

Can you test this kernel and see if it resolves this bug?

Note, to test this kernel, you need to install both the linux-image and linux-image-extra .deb packages.

Thanks in advance!

Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: Triaged → In Progress
Revision history for this message
bugproxy (bugproxy) wrote :
Download full text (6.6 KiB)

------- Comment From <email address hidden> 2018-03-29 11:31 EDT-------
(In reply to comment #10)
> I built a Bionic test kernel with the three commits mentioned in the bug
> description. The test kernel can be downloaded from:
> http://kernel.ubuntu.com/~jsalisbury/lp1758206
>
> Can you test this kernel and see if it resolves this bug?
>
> Note, to test this kernel, you need to install both the linux-image and
> linux-image-extra .deb packages.
>
> Thanks in advance!

Tried with given kernel, kexec still failed. Please find logs below.

root@ltc-wspoon4:~# ppc64_cpu --smt
SMT is off
root@ltc-wspoon4:~# kdump-config show
DUMP_MODE: kdump
USE_KDUMP: 1
KDUMP_SYSCTL: kernel.panic_on_oops=1
KDUMP_COREDIR: /var/crash
crashkernel addr:
/var/lib/kdump/vmlinuz: symbolic link to /boot/vmlinux-4.15.0-12-generic
kdump initrd:
/var/lib/kdump/initrd.img: symbolic link to /var/lib/kdump/initrd.img-4.15.0-12-generic
current state: ready to kdump

kexec command:
/sbin/kexec -p --command-line="root=UUID=0266024d-8ea3-4132-ad62-b49befd6f8d9 ro quiet splash nr_cpus=1 systemd.unit=kdump-tools.service irqpoll noirqdistrib nousb" --initrd=/var/lib/kdump/initrd.img /var/lib/kdump/vmlinuz
root@ltc-wspoon4:~# echo "c" > /proc/sysrq-trigger
[ 951.567597] sysrq: SysRq : This sysrq operation is disabled.
root@ltc-wspoon4:~# echo 1 > /proc/sys/kernel/sysrq
root@ltc-wspoon4:~# echo "c" > /proc/sysrq-trigger
[ 968.396522] sysrq: SysRq : Trigger a crash
[ 968.396558] Unable to handle kernel paging request for data at address 0x00000000
[ 968.396602] Faulting instruction address: 0xc0000000007ec768
[ 968.396640] Oops: Kernel access of bad area, sig: 11 [#1]
[ 968.396670] LE SMP NR_CPUS=2048 NUMA PowerNV
[ 968.396703] Modules linked in: idt_89hpesx(E) at24 uio_pdrv_genirq ofpart cmdlinepart powernv_flash mtd uio ibmpowernv ipmi_powernv vmx_crypto ipmi_devintf ipmi_msghandler opal_prd crct10dif_vpmsum sch_fq_codel ip_tables x_tables autofs4 ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci crc32c_vpmsum drm tg3 libahci
[ 968.396893] CPU: 28 PID: 3086 Comm: bash Tainted: G E 4.15.0-12-generic #13~lp1758206
[ 968.396944] NIP: c0000000007ec768 LR: c0000000007ed6a8 CTR: c0000000007ec740
[ 968.396989] REGS: c0000000054fb9f0 TRAP: 0300 Tainted: G E (4.15.0-12-generic)
[ 968.397040] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 28222222 XER: 20040000
[ 968.397090] CFAR: c0000000007ed6a4 DAR: 0000000000000000 DSISR: 42000000 SOFTE: 1
[ 968.397090] GPR00: c0000000007ed6a8 c0000000054fbc70 c0000000016eaf00 0000000000000063
[ 968.397090] GPR04: c000001ff76bce18 c000001ff76d4368 9000000000009033 000000000000000a
[ 968.397090] GPR08: 0000000000000007 0000000000000001 0000000000000000 9000000000001003
[ 968.397090] GPR12: c0000000007ec740 c000000007a33400 00000a463c88ae48 0000000000000000
[ 968.397090] GPR16: 00000a462439e9f0 00000a4624431998 00000a46244319d0 00000a4624468204
[ 968.397090] GPR20: 0000000000000000 0000000000000001 0000000000000000 00007ffff9ecd164
[ 968.397090] GPR24: 00007ffff9ecd160 00000a462446afc4 c0000000015e9968 0000000000000002
[ 968.397090] GPR28: 0...

Read more...

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Can you confirm the appropriate test kernel was booted with 'uname -a'? You should see the text 'lp1758206' in the kernel name.

Also, could there be more the the three patches required that were posted to the description:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=04b9c96eae72d862726f2f4bfcec2078240c33c5
("powerpc/crash: Remove the test for cpu_online in the IPI callback")

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4145f358644b970fcff293c09fdcc7939e8527d2
("powernv/kdump: Fix cases where the kdump kernel can get HMI's")

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=910961754572a2f4c83ad7e610d180
("powerpc/kdump: Fix powernv build break when KEXEC_CORE=n")

Revision history for this message
bugproxy (bugproxy) wrote :
Download full text (14.6 KiB)

------- Comment From <email address hidden> 2018-03-30 01:14 EDT-------
Tested again with given kernel, dump capture is successful with smt=2 and smt=off.

Sorry fr the wrong update in previous comment, not sure what i had missed yesterday.

root@ltc-wspoon4:~# uname -a
Linux ltc-wspoon4 4.15.0-12-generic #13~lp1758206 SMP Tue Mar 27 15:20:59 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux
root@ltc-wspoon4:~# ppc64_cpu --smt=off
root@ltc-wspoon4:~#
root@ltc-wspoon4:~# echo 1 > /proc/sys/kernel/sysrq
root@ltc-wspoon4:~# echo "c" > /proc/sysrq-trigger
[ 1424.806117] sysrq: SysRq : Trigger a crash
[ 1424.806163] Unable to handle kernel paging request for data at address 0x00000000
[ 1424.806267] Faulting instruction address: 0xc0000000007ec768
[ 1424.806352] Oops: Kernel access of bad area, sig: 11 [#1]
[ 1424.806424] LE SMP NR_CPUS=2048 NUMA PowerNV
[ 1424.806483] Modules linked in: idt_89hpesx(E) at24 ofpart uio_pdrv_genirq cmdlinepart powernv_flash uio mtd opal_prd ipmi_powernv ipmi_devintf ibmpowernv vmx_crypto ipmi_msghandler crct10dif_vpmsum sch_fq_codel ip_tables x_tables autofs4 ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci crc32c_vpmsum drm tg3 libahci
[ 1424.806828] CPU: 0 PID: 3110 Comm: bash Tainted: G E 4.15.0-12-generic #13~lp1758206
[ 1424.806963] NIP: c0000000007ec768 LR: c0000000007ed6a8 CTR: c0000000007ec740
[ 1424.807075] REGS: c000001fce3d39f0 TRAP: 0300 Tainted: G E (4.15.0-12-generic)
[ 1424.807211] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 28222222 XER: 20040000
[ 1424.807325] CFAR: c0000000007ed6a4 DAR: 0000000000000000 DSISR: 42000000 SOFTE: 1
[ 1424.807325] GPR00: c0000000007ed6a8 c000001fce3d3c70 c0000000016eaf00 0000000000000063
[ 1424.807325] GPR04: c000001ff6fbce18 c000001ff6fd4368 9000000000009033 000000000000000a
[ 1424.807325] GPR08: 0000000000000007 0000000000000001 0000000000000000 9000000000001003
[ 1424.807325] GPR12: c0000000007ec740 c000000007a20000 000006127f00ae48 0000000000000000
[ 1424.807325] GPR16: 000006124f78e9f0 000006124f821998 000006124f8219d0 000006124f858204
[ 1424.807325] GPR20: 0000000000000000 0000000000000001 0000000000000000 00007fffd6e57524
[ 1424.807325] GPR24: 00007fffd6e57520 000006124f85afc4 c0000000015e9968 0000000000000002
[ 1424.807325] GPR28: 0000000000000063 0000000000000004 c000000001572a9c c0000000015e9d08
[ 1424.808272] NIP [c0000000007ec768] sysrq_handle_crash+0x28/0x30
[ 1424.808364] LR [c0000000007ed6a8] __handle_sysrq+0xf8/0x2c0
[ 1424.808417] Call Trace:
[ 1424.808468] [c000001fce3d3c70] [c0000000007ed688] __handle_sysrq+0xd8/0x2c0 (unreliable)
[ 1424.808582] [c000001fce3d3d10] [c0000000007edeb4] write_sysrq_trigger+0x64/0x90
[ 1424.808690] [c000001fce3d3d40] [c00000000047dfe8] proc_reg_write+0x88/0xd0
[ 1424.808782] [c000001fce3d3d70] [c0000000003d131c] __vfs_write+0x3c/0x70
[ 1424.808875] [c000001fce3d3d90] [c0000000003d1578] vfs_write+0xd8/0x220
[ 1424.808957] [c000001fce3d3de0] [c0000000003d1898] SyS_write+0x68/0x110
[ 1424.809038] [c000001fce3d3e30] [c00000000000b184] system_call+0x58/0x6c
[ 1424.809139] Instruction dump:
[ 1424.809191] 4bfff9f1 4bfffe50 3c4c00f0 3842e7c0 7c0802a6 60000000 ...

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :
Seth Forshee (sforshee)
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: In Progress → Fix Committed
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-04-05 07:34 EDT-------
Issue is resolved in 4.15.0-15-generic kernel.

root@ltc-wspoon4:~# ppc64_cpu --smt
SMT is off

Starting Kernel crash dump capture service...
[ 11.747657] kdump-tools[952]: Starting kdump-tools: * running makedumpfile -c -d 31 /proc/vmcore /var/crash/201804050626/dump-incomplete
Copying data : [100.0 %] \ eta: 0s
[ 27.390223] kdump-tools[952]: The kernel version is not supported.
[ 27.390438] kdump-tools[952]: The makedumpfile operation may be incomplete.
[ 27.390563] kdump-tools[952]: The dumpfile is saved to /var/crash/201804050626/dump-incomplete.
[ 27.390726] kdump-tools[952]: makedumpfile Completed.
[ 27.405543] kdump-tools[952]: * kdump-tools: saved vmcore in /var/crash/201804050626
[ 30.762418] kdump-tools[952]: * running makedumpfile --dump-dmesg /proc/vmcore /var/crash/201804050626/dmesg.201804050626
[ 30.802776] kdump-tools[952]: The kernel version is not supported.
[ 30.802923] kdump-tools[952]: The makedumpfile operation may be incomplete.
[ 30.803025] kdump-tools[952]: The dmesg log is saved to /var/crash/201804050626/dmesg.201804050626.
[ 30.803145] kdump-tools[952]: makedumpfile Completed.
[ 30.803263] kdump-tools[952]: * kdump-tools: saved dmesg content in /var/crash/201804050626
[ 30.888353] kdump-tools[952]: Thu, 05 Apr 2018 06:26:24 -0500
[ 31.035631] kdump-tools[952]: Rebooting.
[ 31.126613] reboot: Restarting system
[ 1577.265030518,5] OPAL: Reboot request...

root@ltc-wspoon4:~# ppc64_cpu --smt
SMT=2

Starting Kernel crash dump capture service...
[ 13.378626] kdump-tools[952]: Starting kdump-tools: * running makedumpfile -c -d 31 /proc/vmcore /var/crash/201804050631/dump-incomplete
Copying data : [100.0 %] | eta: 0s
[ 27.102530] kdump-tools[952]: The kernel version is not supported.
[ 27.102659] kdump-tools[952]: The makedumpfile operation may be incomplete.
[ 27.102787] kdump-tools[952]: The dumpfile is saved to /var/crash/201804050631/dump-incomplete.
[ 27.102910] kdump-tools[952]: makedumpfile Completed.
[ 27.112064] kdump-tools[952]: * kdump-tools: saved vmcore in /var/crash/201804050631
[ 29.632162] kdump-tools[952]: * running makedumpfile --dump-dmesg /proc/vmcore /var/crash/201804050631/dmesg.201804050631
[ 29.672730] kdump-tools[952]: The kernel version is not supported.
[ 29.672890] kdump-tools[952]: The makedumpfile operation may be incomplete.
[ 29.672997] kdump-tools[952]: The dmesg log is saved to /var/crash/201804050631/dmesg.201804050631.
[ 29.673111] kdump-tools[952]: makedumpfile Completed.
[ 29.673249] kdump-tools[952]: * kdump-tools: saved dmesg content in /var/crash/201804050631
[ 29.774672] kdump-tools[952]: Thu, 05 Apr 2018 06:31:40 -0500
[ 29.913780] kdump-tools[952]: Rebooting.

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (40.4 KiB)

This bug was fixed in the package linux - 4.15.0-15.16

---------------
linux (4.15.0-15.16) bionic; urgency=medium

  * linux: 4.15.0-15.16 -proposed tracker (LP: #1761177)

  * FFe: Enable configuring resume offset via sysfs (LP: #1760106)
    - PM / hibernate: Make passing hibernate offsets more friendly

  * /dev/bcache/by-uuid links not created after reboot (LP: #1729145)
    - SAUCE: (no-up) bcache: decouple emitting a cached_dev CHANGE uevent

  * Ubuntu18.04:POWER9:DD2.2 - Unable to start a KVM guest with default machine
    type(pseries-bionic) complaining "KVM implementation does not support
    Transactional Memory, try cap-htm=off" (kvm) (LP: #1752026)
    - powerpc: Use feature bit for RTC presence rather than timebase presence
    - powerpc: Book E: Remove unused CPU_FTR_L2CSR bit
    - powerpc: Free up CPU feature bits on 64-bit machines
    - powerpc: Add CPU feature bits for TM bug workarounds on POWER9 v2.2
    - powerpc/powernv: Provide a way to force a core into SMT4 mode
    - KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9
    - KVM: PPC: Book3S HV: Work around XER[SO] bug in fake suspend mode
    - KVM: PPC: Book3S HV: Work around TEXASR bug in fake suspend state

  * Important Kernel fixes to be backported for Power9 (kvm) (LP: #1758910)
    - powerpc/mm: Fixup tlbie vs store ordering issue on POWER9

  * Ubuntu 18.04 - IO Hang on some namespaces when running HTX with 16
    namespaces (Bolt / NVMe) (LP: #1757497)
    - powerpc/64s: Fix lost pending interrupt due to race causing lost update to
      irq_happened

  * fwts-efi-runtime-dkms 18.03.00-0ubuntu1: fwts-efi-runtime-dkms kernel module
    failed to build (LP: #1760876)
    - [Packaging] include the retpoline extractor in the headers

linux (4.15.0-14.15) bionic; urgency=medium

  * linux: 4.15.0-14.15 -proposed tracker (LP: #1760678)

  * [Bionic] mlx4 ETH - mlnx_qos failed when set some TC to vendor
    (LP: #1758662)
    - net/mlx4_en: Change default QoS settings

  * AT_BASE_PLATFORM in AUXV is absent on kernels available on Ubuntu 17.10
    (LP: #1759312)
    - powerpc/64s: Fix NULL AT_BASE_PLATFORM when using DT CPU features

  * Bionic update to 4.15.15 stable release (LP: #1760585)
    - net: dsa: Fix dsa_is_user_port() test inversion
    - openvswitch: meter: fix the incorrect calculation of max delta_t
    - qed: Fix MPA unalign flow in case header is split across two packets.
    - tcp: purge write queue upon aborting the connection
    - qed: Fix non TCP packets should be dropped on iWARP ll2 connection
    - sysfs: symlink: export sysfs_create_link_nowarn()
    - net: phy: relax error checking when creating sysfs link netdev->phydev
    - devlink: Remove redundant free on error path
    - macvlan: filter out unsupported feature flags
    - net: ipv6: keep sk status consistent after datagram connect failure
    - ipv6: old_dport should be a __be16 in __ip6_datagram_connect()
    - ipv6: sr: fix NULL pointer dereference when setting encap source address
    - ipv6: sr: fix scheduling in RCU when creating seg6 lwtunnel state
    - mlxsw: spectrum_buffers: Set a minimum quota for CPU port traffic
    - net: phy: Tell caller result ...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Manoj Iyer (manjo)
Changed in ubuntu-power-systems:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.