In Ubuntu16.10: Kdump stuck in boot for longer time need to force reboot via HMC in 32TB Brazos System

Bug #1661168 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
kexec-tools (Ubuntu)
Fix Released
High
dann frazier
Yakkety
New
High
Unassigned

Bug Description

Problem Description
===========================
  In Ubuntu16.10 tried kdump in Brazos system (32TB Memory and 192 core). when trigger panic kdump process stuck in boot process need to do force reboot .After reboot system captured vmcore-incomplete.

Reproducible Step:
======================
1- Install Ubuntu16.10
2- boot system with 31TB and 192 Core
3- configure kdump in system
4- verify kdump in system that it is running
5- Trigger panic in system

Actual Result
--------------------------
kdump process stuck in boot process need to do force reboot

Expected Result
-----------------------------
Kdump will proceed and vmcore captured successfully.

LOG:

root@ltc-brazos1:~# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinux-4.4.0-30-generic root=UUID=516c4b1b-6700-4b55-bd37-d61c4c5af6af ro quiet splash crashkernel=4096M
root@ltc-brazos1:~# kdump-config show
DUMP_MODE: kdump
USE_KDUMP: 1
KDUMP_SYSCTL: kernel.panic_on_oops=1
KDUMP_COREDIR: /var/crash
crashkernel addr:
   /var/lib/kdump/vmlinuz: symbolic link to /boot/vmlinux-4.4.0-30-generic
kdump initrd:
   /var/lib/kdump/initrd.img: symbolic link to /var/lib/kdump/initrd.img-4.4.0-30-generic
current state: ready to kdump

kexec command:
  /sbin/kexec -p --command-line="BOOT_IMAGE=/boot/vmlinux-4.4.0-30-generic root=UUID=516c4b1b-6700-4b55-bd37-d61c4c5af6af ro quiet splash irqpoll nr_cpus=1 nousb systemd.unit=kdump-tools.service" --initrd=/var/lib/kdump/initrd.img /var/lib/kdump/vmlinuz
root@ltc-brazos1:~#
root@ltc-brazos1:~# dpkg -l | grep kdump
ii kdump-tools 1:1.6.0-2 all scripts and tools for automating kdump (Linux crash dumps)
root@ltc-brazos1:~#
root@ltc-brazos1:~# echo c > /proc/sysrq-trigger

ltc-brazos1 login: [ 416.229464] sysrq: SysRq : Trigger a crash
[ 416.229496] Unable to handle kernel paging request for data at address 0x00000000
[ 416.229502] Faulting instruction address: 0xc000000000670014
[ 416.229508] Oops: Kernel access of bad area, sig: 11 [#1]
[ 416.229511] SMP NR_CPUS=2048 NUMA pSeries
[ 416.229517] Modules linked in: pseries_rng btrfs xor raid6_pq rtc_generic sunrpc autofs4 ses enclosure ipr
[ 416.229532] CPU: 65 PID: 404785 Comm: bash Not tainted 4.4.0-30-generic #49-Ubuntu
[ 416.229537] task: c00001f9d583c8e0 ti: c00001fa13cd8000 task.ti: c00001fa13cd8000
[ 416.229543] NIP: c000000000670014 LR: c0000000006710c8 CTR: c00000000066ffe0
[ 416.229548] REGS: c00001fa13cdb990 TRAP: 0300 Not tainted (4.4.0-30-generic)
[ 416.229552] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 28242222 XER: 00000001
[ 416.229565] CFAR: c000000000008468 DAR: 0000000000000000 DSISR: 42000000 SOFTE: 1
GPR00: c0000000006710c8 c00001fa13cdbc10 c0000000015b5d00 0000000000000063
GPR04: c00001fab9049c50 c00001fab905b4e0 c0001f3fff3d0000 0000000000000313
GPR08: 0000000000000007 0000000000000001 0000000000000000 c0001f3fff3dec68
GPR12: c00000000066ffe0 c000000007546980 ffffffffffffffff 0000000022000000
GPR16: 0000000010170dc8 00000100174901d8 0000000010140f58 00000000100c7570
GPR20: 0000000000000000 000000001017dd58 0000000010153618 000000001017b608
GPR24: 00003ffff8966c94 0000000000000001 c0000000014f8e58 0000000000000004
GPR28: c0000000014f9218 0000000000000063 c0000000014b11dc 0000000000000000
[ 416.229631] NIP [c000000000670014] sysrq_handle_crash+0x34/0x50
[ 416.229636] LR [c0000000006710c8] __handle_sysrq+0xe8/0x270
[ 416.229640] Call Trace:
[ 416.229645] [c00001fa13cdbc10] [c000000000e08f28] _fw_tigon_tg3_bin_name+0x2ce58/0x342b0 (unreliable)
[ 416.229652] [c00001fa13cdbc30] [c0000000006710c8] __handle_sysrq+0xe8/0x270
[ 416.229658] [c00001fa13cdbcd0] [c000000000671868] write_sysrq_trigger+0x78/0xa0
[ 416.229666] [c00001fa13cdbd00] [c00000000037ae30] proc_reg_write+0xb0/0x110
[ 416.229673] [c00001fa13cdbd50] [c0000000002e186c] __vfs_write+0x6c/0xe0
[ 416.229678] [c00001fa13cdbd90] [c0000000002e25a0] vfs_write+0xc0/0x230
[ 416.229684] [c00001fa13cdbde0] [c0000000002e35dc] SyS_write+0x6c/0x110
[ 416.229690] [c00001fa13cdbe30] [c000000000009204] system_call+0x38/0xb4
[ 416.229695] Instruction dump:
[ 416.229698] 38425d20 7c0802a6 f8010010 f821ffe1 60000000 60000000 3d220019 394931e4
[ 416.229707] 39200001 912a0000 7c0004ac 39400000 <992a0000> 38210020 e8010010 7c0803a6
[ 416.229717] ---[ end trace 16e5fbbf7faa7340 ]---
[ 416.232059]
[ 416.232086] Sending IPI to other CPUs
[ 416.242558] IPI complete
[ [ 416.229695] Instruction dump:
[ 416.229698] 38425d20 7c0802a6 f8010010 f821ffe1 60000000 60000000 3d220019 394931e4
[ 416.229707] 39200001 912a0000 7c0004ac 39400000 <992a0000> 38210020 e8010010 7c0803a6
[ 416.229717] ---[ end trace 16e5fbbf7faa7340 ]---
[ 416.232059]
[ 416.232086] Sending IPI to other CPUs
[ 416.242558] IPI complete
I'm in purgatory
 -> smp_release_cpus()
spinning_secondaries = 1528
 <- smp_release_cpus()
 <- setup_system()
[ 1.146155] sd 0:2:1:0: [sdb] Assuming drive cache: write through
[ 1.154176] sd 0:2:0:0: [sda] Assuming drive cache: write through
/dev/sdb2: recovering journal
/dev/sdb2: clean, 69482/136331264 files, 9047821/545318400 blocks

---------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------
tu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f. . . .1;-1fUbuntu 16.101;-1f.

---------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------

after force reboot

root@ltc-brazos1:/var/crash# ls
201607161510 kexec_cmd
root@ltc-brazos1:/var/crash# cd 201607161510/
root@ltc-brazos1:/var/crash/201607161510# ls
vmcore-incomplete
root@ltc-brazos1:

Note : waited for Kdump process more than 2 Hour .

Regards
Praveen

== Comment: #12 - Vaishnavi Bhat <email address hidden> - 2016-09-16 02:40:20 ==
root@ltc-brazos1:~# kdump-config show
DUMP_MODE: kdump
USE_KDUMP: 1
KDUMP_SYSCTL: kernel.panic_on_oops=1
KDUMP_COREDIR: /var/crash
crashkernel addr:
   /var/lib/kdump/vmlinuz: symbolic link to /boot/vmlinux-4.4.0-9136-generic
kdump initrd:
   /var/lib/kdump/initrd.img: symbolic link to /var/lib/kdump/initrd.img-4.4.0-9136-generic
current state: ready to kdump

kexec command:
  /sbin/kexec -p --command-line="BOOT_IMAGE=/boot/vmlinux-4.4.0-9136-generic root=UUID=bfdd4041-1b2f-42b1-b202-2c09f781bbcc ro quiet splash irqpoll nr_cpus=1 nousb systemd.unit=kdump-tools.service" --initrd=/var/lib/kdump/initrd.img /var/lib/kdump/vmlinuz

root@ltc-brazos1:~# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinux-4.4.0-9136-generic root=UUID=bfdd4041-1b2f-42b1-b202-2c09f781bbcc ro crashkernel=4096M quiet splash crashkernel=4096M

root@ltc-brazos1:~# dmesg | grep -i crash
[ 0.000000] Reserving 4096MB of memory at 128MB for crashkernel (System RAM: 31744000MB)
[ 0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinux-4.4.0-9136-generic root=UUID=bfdd4041-1b2f-42b1-b202-2c09f781bbcc ro crashkernel=4096M quiet splash crashkernel=4096M

== Comment: #26 - Hari Krishna Bathini <email address hidden> - 2017-02-01 02:02:36 ==
The following kexec-tools commit is needed to fix this issue:

  commit f63d8530b9b6a2d7e79b946e326e5a2197eb8f87
  Author: Petr Tesarik <email address hidden>
  Date: Thu Jan 19 18:37:09 2017 +0100

    ppc64: Reduce number of ELF LOAD segments

    The number of program header table entries (e_phnum) is an Elf64_Half,
    which is a 16-bit entity, i.e. the limit is 65534 entries (one entry is
    reserved for NOTE). This is a hard limit, defined by the ELF standard.
    It is possible that more LMBs (Logical Memory Blocks) are needed to
    represent all RAM on some machines, and this field overflows, causing
    an incomplete /proc/vmcore file.

    This has actually happened on a machine with 31TB of RAM and an LMB size
    of 256MB.

    However, since there is usually no memory hole between adjacent LMBs, the
    map can be "compressed", combining multiple adjacent into a single LOAD
    segment.

    Signed-off-by: Petr Tesarik <email address hidden>
    Signed-off-by: Simon Horman <email address hidden>

bugproxy (bugproxy)
tags: added: architecture-ppc64le bugnameltc-143828 severity-high targetmilestone-inin---
Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → kexec-tools (Ubuntu)
Revision history for this message
Michael Hohnbaum (hohnbaum) wrote : Re: [Bug 1661168] [NEW] In Ubuntu16.10: Kdump stuck in boot for longer time need to force reboot via HMC in 32TB Brazos System
Download full text (16.3 KiB)

Louis,

While we can't test this without access to a machine with large amounts
of memory, is it possible to apply this patch and provide an image to
IBM for testing?

                      Michael

On 02/01/2017 11:09 PM, bugproxy wrote:
> Public bug reported:
>
> Problem Description
> ===========================
> In Ubuntu16.10 tried kdump in Brazos system (32TB Memory and 192 core). when trigger panic kdump process stuck in boot process need to do force reboot .After reboot system captured vmcore-incomplete.
>
> Reproducible Step:
> ======================
> 1- Install Ubuntu16.10
> 2- boot system with 31TB and 192 Core
> 3- configure kdump in system
> 4- verify kdump in system that it is running
> 5- Trigger panic in system
>
> Actual Result
> --------------------------
> kdump process stuck in boot process need to do force reboot
>
> Expected Result
> -----------------------------
> Kdump will proceed and vmcore captured successfully.
>
> LOG:
>
> root@ltc-brazos1:~# cat /proc/cmdline
> BOOT_IMAGE=/boot/vmlinux-4.4.0-30-generic root=UUID=516c4b1b-6700-4b55-bd37-d61c4c5af6af ro quiet splash crashkernel=4096M
> root@ltc-brazos1:~# kdump-config show
> DUMP_MODE: kdump
> USE_KDUMP: 1
> KDUMP_SYSCTL: kernel.panic_on_oops=1
> KDUMP_COREDIR: /var/crash
> crashkernel addr:
> /var/lib/kdump/vmlinuz: symbolic link to /boot/vmlinux-4.4.0-30-generic
> kdump initrd:
> /var/lib/kdump/initrd.img: symbolic link to /var/lib/kdump/initrd.img-4.4.0-30-generic
> current state: ready to kdump
>
> kexec command:
> /sbin/kexec -p --command-line="BOOT_IMAGE=/boot/vmlinux-4.4.0-30-generic root=UUID=516c4b1b-6700-4b55-bd37-d61c4c5af6af ro quiet splash irqpoll nr_cpus=1 nousb systemd.unit=kdump-tools.service" --initrd=/var/lib/kdump/initrd.img /var/lib/kdump/vmlinuz
> root@ltc-brazos1:~#
> root@ltc-brazos1:~# dpkg -l | grep kdump
> ii kdump-tools 1:1.6.0-2 all scripts and tools for automating kdump (Linux crash dumps)
> root@ltc-brazos1:~#
> root@ltc-brazos1:~# echo c > /proc/sysrq-trigger
>
>
> ltc-brazos1 login: [ 416.229464] sysrq: SysRq : Trigger a crash
> [ 416.229496] Unable to handle kernel paging request for data at address 0x00000000
> [ 416.229502] Faulting instruction address: 0xc000000000670014
> [ 416.229508] Oops: Kernel access of bad area, sig: 11 [#1]
> [ 416.229511] SMP NR_CPUS=2048 NUMA pSeries
> [ 416.229517] Modules linked in: pseries_rng btrfs xor raid6_pq rtc_generic sunrpc autofs4 ses enclosure ipr
> [ 416.229532] CPU: 65 ...

dann frazier (dannf)
Changed in kexec-tools (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → dann frazier (dannf)
status: New → In Progress
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package kexec-tools - 1:2.0.14-1ubuntu2

---------------
kexec-tools (1:2.0.14-1ubuntu2) zesty; urgency=medium

  [ Manoj Iyer ]
  * Enable compressed kernel support for ARM64 (LP: #1661363).

  [ dann frazier ]
  * ppc64-Reduce-number-of-ELF-LOAD-segments.patch: Cherry-pick
    from upstream, fixing kexec on some large memory configurations
    (LP: #1661168).

 -- dann frazier <email address hidden> Fri, 03 Feb 2017 14:49:31 -0700

Changed in kexec-tools (Ubuntu):
status: In Progress → Fix Released
Mathew Hodson (mhodson)
Changed in kexec-tools (Ubuntu):
importance: Undecided → Medium
Changed in kexec-tools (Ubuntu Yakkety):
importance: Undecided → Medium
Changed in kexec-tools (Ubuntu):
importance: Medium → High
Changed in kexec-tools (Ubuntu Yakkety):
importance: Medium → High
bugproxy (bugproxy)
tags: added: targetmilestone-inin1704
removed: targetmilestone-inin---
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.