In Ubuntu16.10: Kdump stuck in boot for longer time need to force reboot via HMC in 32TB Brazos System
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
kexec-tools (Ubuntu) |
Fix Released
|
High
|
dann frazier | ||
Yakkety |
New
|
High
|
Unassigned |
Bug Description
Problem Description
=======
In Ubuntu16.10 tried kdump in Brazos system (32TB Memory and 192 core). when trigger panic kdump process stuck in boot process need to do force reboot .After reboot system captured vmcore-incomplete.
Reproducible Step:
=======
1- Install Ubuntu16.10
2- boot system with 31TB and 192 Core
3- configure kdump in system
4- verify kdump in system that it is running
5- Trigger panic in system
Actual Result
-------
kdump process stuck in boot process need to do force reboot
Expected Result
-------
Kdump will proceed and vmcore captured successfully.
LOG:
root@ltc-brazos1:~# cat /proc/cmdline
BOOT_IMAGE=
root@ltc-brazos1:~# kdump-config show
DUMP_MODE: kdump
USE_KDUMP: 1
KDUMP_SYSCTL: kernel.
KDUMP_COREDIR: /var/crash
crashkernel addr:
/var/
kdump initrd:
/var/
current state: ready to kdump
kexec command:
/sbin/kexec -p --command-
root@ltc-brazos1:~#
root@ltc-brazos1:~# dpkg -l | grep kdump
ii kdump-tools 1:1.6.0-2 all scripts and tools for automating kdump (Linux crash dumps)
root@ltc-brazos1:~#
root@ltc-brazos1:~# echo c > /proc/sysrq-trigger
ltc-brazos1 login: [ 416.229464] sysrq: SysRq : Trigger a crash
[ 416.229496] Unable to handle kernel paging request for data at address 0x00000000
[ 416.229502] Faulting instruction address: 0xc000000000670014
[ 416.229508] Oops: Kernel access of bad area, sig: 11 [#1]
[ 416.229511] SMP NR_CPUS=2048 NUMA pSeries
[ 416.229517] Modules linked in: pseries_rng btrfs xor raid6_pq rtc_generic sunrpc autofs4 ses enclosure ipr
[ 416.229532] CPU: 65 PID: 404785 Comm: bash Not tainted 4.4.0-30-generic #49-Ubuntu
[ 416.229537] task: c00001f9d583c8e0 ti: c00001fa13cd8000 task.ti: c00001fa13cd8000
[ 416.229543] NIP: c000000000670014 LR: c0000000006710c8 CTR: c00000000066ffe0
[ 416.229548] REGS: c00001fa13cdb990 TRAP: 0300 Not tainted (4.4.0-30-generic)
[ 416.229552] MSR: 8000000000009033 <SF,EE,
[ 416.229565] CFAR: c000000000008468 DAR: 0000000000000000 DSISR: 42000000 SOFTE: 1
GPR00: c0000000006710c8 c00001fa13cdbc10 c0000000015b5d00 0000000000000063
GPR04: c00001fab9049c50 c00001fab905b4e0 c0001f3fff3d0000 0000000000000313
GPR08: 0000000000000007 0000000000000001 0000000000000000 c0001f3fff3dec68
GPR12: c00000000066ffe0 c000000007546980 ffffffffffffffff 0000000022000000
GPR16: 0000000010170dc8 00000100174901d8 0000000010140f58 00000000100c7570
GPR20: 0000000000000000 000000001017dd58 0000000010153618 000000001017b608
GPR24: 00003ffff8966c94 0000000000000001 c0000000014f8e58 0000000000000004
GPR28: c0000000014f9218 0000000000000063 c0000000014b11dc 0000000000000000
[ 416.229631] NIP [c000000000670014] sysrq_handle_
[ 416.229636] LR [c0000000006710c8] __handle_
[ 416.229640] Call Trace:
[ 416.229645] [c00001fa13cdbc10] [c000000000e08f28] _fw_tigon_
[ 416.229652] [c00001fa13cdbc30] [c0000000006710c8] __handle_
[ 416.229658] [c00001fa13cdbcd0] [c000000000671868] write_sysrq_
[ 416.229666] [c00001fa13cdbd00] [c00000000037ae30] proc_reg_
[ 416.229673] [c00001fa13cdbd50] [c0000000002e186c] __vfs_write+
[ 416.229678] [c00001fa13cdbd90] [c0000000002e25a0] vfs_write+
[ 416.229684] [c00001fa13cdbde0] [c0000000002e35dc] SyS_write+
[ 416.229690] [c00001fa13cdbe30] [c000000000009204] system_
[ 416.229695] Instruction dump:
[ 416.229698] 38425d20 7c0802a6 f8010010 f821ffe1 60000000 60000000 3d220019 394931e4
[ 416.229707] 39200001 912a0000 7c0004ac 39400000 <992a0000> 38210020 e8010010 7c0803a6
[ 416.229717] ---[ end trace 16e5fbbf7faa7340 ]---
[ 416.232059]
[ 416.232086] Sending IPI to other CPUs
[ 416.242558] IPI complete
[ [ 416.229695] Instruction dump:
[ 416.229698] 38425d20 7c0802a6 f8010010 f821ffe1 60000000 60000000 3d220019 394931e4
[ 416.229707] 39200001 912a0000 7c0004ac 39400000 <992a0000> 38210020 e8010010 7c0803a6
[ 416.229717] ---[ end trace 16e5fbbf7faa7340 ]---
[ 416.232059]
[ 416.232086] Sending IPI to other CPUs
[ 416.242558] IPI complete
I'm in purgatory
-> smp_release_cpus()
spinning_
<- smp_release_cpus()
<- setup_system()
[ 1.146155] sd 0:2:1:0: [sdb] Assuming drive cache: write through
[ 1.154176] sd 0:2:0:0: [sda] Assuming drive cache: write through
/dev/sdb2: recovering journal
/dev/sdb2: clean, 69482/136331264 files, 9047821/545318400 blocks
-------
-------
tu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f.
-------
-------
-------
-------
after force reboot
root@ltc-
201607161510 kexec_cmd
root@ltc-
root@ltc-
vmcore-incomplete
root@ltc-brazos1:
Note : waited for Kdump process more than 2 Hour .
Regards
Praveen
== Comment: #12 - Vaishnavi Bhat <email address hidden> - 2016-09-16 02:40:20 ==
root@ltc-brazos1:~# kdump-config show
DUMP_MODE: kdump
USE_KDUMP: 1
KDUMP_SYSCTL: kernel.
KDUMP_COREDIR: /var/crash
crashkernel addr:
/var/
kdump initrd:
/var/
current state: ready to kdump
kexec command:
/sbin/kexec -p --command-
root@ltc-brazos1:~# cat /proc/cmdline
BOOT_IMAGE=
root@ltc-brazos1:~# dmesg | grep -i crash
[ 0.000000] Reserving 4096MB of memory at 128MB for crashkernel (System RAM: 31744000MB)
[ 0.000000] Kernel command line: BOOT_IMAGE=
== Comment: #26 - Hari Krishna Bathini <email address hidden> - 2017-02-01 02:02:36 ==
The following kexec-tools commit is needed to fix this issue:
commit f63d8530b9b6a2d
Author: Petr Tesarik <email address hidden>
Date: Thu Jan 19 18:37:09 2017 +0100
ppc64: Reduce number of ELF LOAD segments
The number of program header table entries (e_phnum) is an Elf64_Half,
which is a 16-bit entity, i.e. the limit is 65534 entries (one entry is
reserved for NOTE). This is a hard limit, defined by the ELF standard.
It is possible that more LMBs (Logical Memory Blocks) are needed to
represent all RAM on some machines, and this field overflows, causing
an incomplete /proc/vmcore file.
This has actually happened on a machine with 31TB of RAM and an LMB size
of 256MB.
However, since there is usually no memory hole between adjacent LMBs, the
map can be "compressed", combining multiple adjacent into a single LOAD
segment.
Signed-off-by: Petr Tesarik <email address hidden>
Signed-off-by: Simon Horman <email address hidden>
tags: | added: architecture-ppc64le bugnameltc-143828 severity-high targetmilestone-inin--- |
Changed in ubuntu: | |
assignee: | nobody → Taco Screen team (taco-screen-team) |
affects: | ubuntu → kexec-tools (Ubuntu) |
Changed in kexec-tools (Ubuntu): | |
assignee: | Taco Screen team (taco-screen-team) → dann frazier (dannf) |
status: | New → In Progress |
Changed in kexec-tools (Ubuntu): | |
importance: | Undecided → Medium |
Changed in kexec-tools (Ubuntu Yakkety): | |
importance: | Undecided → Medium |
Changed in kexec-tools (Ubuntu): | |
importance: | Medium → High |
Changed in kexec-tools (Ubuntu Yakkety): | |
importance: | Medium → High |
tags: |
added: targetmilestone-inin1704 removed: targetmilestone-inin--- |
Louis,
While we can't test this without access to a machine with large amounts
of memory, is it possible to apply this patch and provide an image to
IBM for testing?
On 02/01/2017 11:09 PM, bugproxy wrote: ======= ======= ====== ======= ======= = ------- ------- ----- ------- ------- ------- - /boot/vmlinux- 4.4.0-30- generic root=UUID= 516c4b1b- 6700-4b55- bd37-d61c4c5af6 af ro quiet splash crashkernel=4096M panic_on_ oops=1 kdump/vmlinuz: symbolic link to /boot/vmlinux- 4.4.0-30- generic kdump/initrd. img: symbolic link to /var/lib/ kdump/initrd. img-4.4. 0-30-generic line="BOOT_ IMAGE=/ boot/vmlinux- 4.4.0-30- generic root=UUID= 516c4b1b- 6700-4b55- bd37-d61c4c5af6 af ro quiet splash irqpoll nr_cpus=1 nousb systemd. unit=kdump- tools.service" --initrd= /var/lib/ kdump/initrd. img /var/lib/ kdump/vmlinuz
> Public bug reported:
>
> Problem Description
> =======
> In Ubuntu16.10 tried kdump in Brazos system (32TB Memory and 192 core). when trigger panic kdump process stuck in boot process need to do force reboot .After reboot system captured vmcore-incomplete.
>
> Reproducible Step:
> =======
> 1- Install Ubuntu16.10
> 2- boot system with 31TB and 192 Core
> 3- configure kdump in system
> 4- verify kdump in system that it is running
> 5- Trigger panic in system
>
> Actual Result
> -------
> kdump process stuck in boot process need to do force reboot
>
> Expected Result
> -------
> Kdump will proceed and vmcore captured successfully.
>
> LOG:
>
> root@ltc-brazos1:~# cat /proc/cmdline
> BOOT_IMAGE=
> root@ltc-brazos1:~# kdump-config show
> DUMP_MODE: kdump
> USE_KDUMP: 1
> KDUMP_SYSCTL: kernel.
> KDUMP_COREDIR: /var/crash
> crashkernel addr:
> /var/lib/
> kdump initrd:
> /var/lib/
> current state: ready to kdump
>
> kexec command:
> /sbin/kexec -p --command-
> root@ltc-brazos1:~#
> root@ltc-brazos1:~# dpkg -l | grep kdump
> ii kdump-tools 1:1.6.0-2 all scripts and tools for automating kdump (Linux crash dumps)
> root@ltc-brazos1:~#
> root@ltc-brazos1:~# echo c > /proc/sysrq-trigger
>
>
> ltc-brazos1 login: [ 416.229464] sysrq: SysRq : Trigger a crash
> [ 416.229496] Unable to handle kernel paging request for data at address 0x00000000
> [ 416.229502] Faulting instruction address: 0xc000000000670014
> [ 416.229508] Oops: Kernel access of bad area, sig: 11 [#1]
> [ 416.229511] SMP NR_CPUS=2048 NUMA pSeries
> [ 416.229517] Modules linked in: pseries_rng btrfs xor raid6_pq rtc_generic sunrpc autofs4 ses enclosure ipr
> [ 416.229532] CPU: 65 ...