In Ubuntu16.10:Fadump fails as Kernel panic reported while dumping-,console got hung on 32TB Brazos System (kdump)

Bug #1627036 reported by bugproxy
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Triaged
High
Canonical Kernel Team

Bug Description

== Comment: #0 - Praveen K. Pandey <email address hidden> - 2016-07-17 02:37:31 ==
Hi

 In Ubuntu16.10 I I tried fadump in Brazos system (32TB Memory and 192 core) , when trigger panic in kernel panic occur and console got hung.

Reproducible Step:

1- Install Ubuntu16.10
2- boot system with 31TB and 192 Core
3- configure fadump in system
4- verify fadump in system that it is running
5- Trigger panic in system

Actual Result

Not able to take Fadump , kernel panic and console got hung

Expected Result

Fadump will be captured

Log:

root@ltc-brazos1:~# kdump-config show
DUMP_MODE: fadump
USE_KDUMP: 1
KDUMP_SYSCTL: kernel.panic_on_oops=1
KDUMP_COREDIR: /var/crash
   /var/lib/kdump/vmlinuz: symbolic link to /boot/vmlinux-4.4.0-30-generic
kdump initrd:
   /var/lib/kdump/initrd.img: symbolic link to /var/lib/kdump/initrd.img-4.4.0-30-generic
current state: ready to fadump
root@ltc-brazos1:~#

root@ltc-brazos1:~# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinux-4.4.0-30-generic root=UUID=516c4b1b-6700-4b55-bd37-d61c4c5af6af ro quiet splash fadump=on fadump_reserve_mem=4096M crashkernel=4096M
root@ltc-brazos1:~#

ltc-brazos1 login: [ 442.749993] sysrq: SysRq : Trigger a crash
[ 442.750031] Unable to handle kernel paging request for data at address 0x00000000
[ 442.750037] Faulting instruction address: 0xc000000000670014
[ 442.750043] Oops: Kernel access of bad area, sig: 11 [#1]
[ 442.750047] SMP NR_CPUS=2048 NUMA pSeries
[ 442.750053] Modules linked in: pseries_rng btrfs xor raid6_pq rtc_generic sunrpc autofs4 ses enclosure ipr
[ 442.750068] CPU: 157 PID: 403890 Comm: bash Not tainted 4.4.0-30-generic #49-Ubuntu
[ 442.750074] task: c00003f97b0af640 ti: c00003f97b104000 task.ti: c00003f97b104000
[ 442.750079] NIP: c000000000670014 LR: c0000000006710c8 CTR: c00000000066ffe0
[ 442.750083] REGS: c00003f97b107990 TRAP: 0300 Not tainted (4.4.0-30-generic)
[ 442.750088] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 28242222 XER: 00000001
[ 442.750100] CFAR: c000000000008468 DAR: 0000000000000000 DSISR: 42000000 SOFTE: 1
GPR00: c0000000006710c8 c00003f97b107c10 c0000000015b5d00 0000000000000063
GPR04: c00001faba749c50 c00001faba75b4e0 c0001f3efe7c0000 0000000000000313
GPR08: 0000000000000007 0000000000000001 0000000000000000 c0001f3efe7cecb8
GPR12: c00000000066ffe0 c00000000bc9d380 ffffffffffffffff 0000000022000000
GPR16: 0000000010170dc8 000001001ef401d8 0000000010140f58 00000000100c7570
GPR20: 0000000000000000 000000001017dd58 0000000010153618 000000001017b608
GPR24: 00003ffff7c9e7b4 0000000000000001 c0000000014f8e58 0000000000000004
GPR28: c0000000014f9218 0000000000000063 c0000000014b11dc 0000000000000000
[ 442.750165] NIP [c000000000670014] sysrq_handle_crash+0x34/0x50
[ 442.750170] LR [c0000000006710c8] __handle_sysrq+0xe8/0x270
[ 442.750174] Call Trace:
[ 442.750179] [c00003f97b107c10] [c000000000e08f28] _fw_tigon_tg3_bin_name+0x2ce58/0x342b0 (unreliable)
[ 442.750186] [c00003f97b107c30] [c0000000006710c8] __handle_sysrq+0xe8/0x270
[ 442.750192] [c00003f97b107cd0] [c000000000671868] write_sysrq_trigger+0x78/0xa0
[ 442.750199] [c00003f97b107d00] [c00000000037ae30] proc_reg_write+0xb0/0x110
[ 442.750205] [c00003f97b107d50] [c0000000002e186c] __vfs_write+0x6c/0xe0
[ 442.750210] [c00003f97b107d90] [c0000000002e25a0] vfs_write+0xc0/0x230
[ 442.750216] [c00003f97b107de0] [c0000000002e35dc] SyS_write+0x6c/0x110
[ 442.750222] [c00003f97b107e30] [c000000000009204] system_call+0x38/0xb4
[ 442.750226] Instruction dump:
[ 442.750229] 38425d20 7c0802a6 f8010010 f821ffe1 60000000 60000000 3d220019 394931e4
[ 442.750238] 39200001 912a0000 7c0004ac 39400000 <992a0000> 38210020 e8010010 7c0803a6
[ 442.750248] ---[ end trace ff61e1bc4dd59a42 ]---
[ 442.752585]

Loading Linux 4.4.0-30-generic ...
Loading initial ramdisk ...
OF stdout device is: /vdevice/vty@30000000
Preparing to boot Linux version 4.4.0-30-generic (buildd@bos01-ppc64el-023) (gcc version 5.3.1 20160413 (Ubuntu/IBM 5.3.1-14ubuntu2.1) ) #49-Ubuntu SMP Fri Jul 1 10:00:36 UTC 2016 (Ubuntu 4.4.0-30.49-generic 4.4.13)
Detected machine type: 0000000000000101
Max number of cores passed to firmware: 256 (NR_CPUS = 2048)
Calling ibm,client-architecture-support... done
command line: BOOT_IMAGE=/boot/vmlinux-4.4.0-30-generic root=UUID=516c4b1b-6700-4b55-bd37-d61c4c5af6af ro quiet splash fadump=on fadump_reserve_mem=4096M crashkernel=4096M
Ignoring mem=0000000100000000 >= ram_top.
memory layout at init:
  memory_limit : 0000000000000000 (16 MB aligned)
  alloc_bottom : 000000000e020000
  alloc_top : 0000000010000000
  alloc_top_hi : 0000000010000000
  rmo_top : 0000000010000000
  ram_top : 0000000010000000
instantiating rtas at 0x000000000e9e0000... done
prom_hold_cpus: skipped
copying OF device tree...
Building dt strings...
Building dt structure...
Device tree strings 0x000000000e030000 -> 0x000000000e0319a4
Device tree struct 0x000000000e040000 -> 0x000000000e640000
Quiescing Open Firmware ...
Booting Linux via __start() ...
 -> smp_release_cpus()
spinning_secondaries = 1535
 <- smp_release_cpus()
 <- setup_system()
[ 0.000000] Kernel panic - not syncing: memblock_virt_alloc_try_nid: Failed to allocate 16777216 bytes align=0x1000000 nid=1 from=0xfffffffffffffff max_addr=0x0
[ 0.000000]
[ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.4.0-30-generic #49-Ubuntu
[ 0.000000] Call Trace:
[ 0.000000] [c0000000015b39d0] [c000000000af955c] dump_stack+0xb0/0xf0 (unreliable)
[ 0.000000] [c0000000015b3a10] [c000000000af5790] panic+0x100/0x2c0
[ 0.000000] [c0000000015b3aa0] [c000000000ed238c] memblock_virt_alloc_try_nid+0xc0/0xe8
[ 0.000000] [c0000000015b3b30] [c0000000002db69c] __earlyonly_bootmem_alloc.constprop.2+0x50/0x74
[ 0.000000] [c0000000015b3b70] [c000000000afc5fc] vmemmap_populate+0xf8/0x250
[ 0.000000] [c0000000015b3c40] [c000000000afdfa8] sparse_mem_map_populate+0x38/0x64
[ 0.000000] [c0000000015b3c70] [c000000000ed4234] sparse_init+0x1d4/0x298
[ 0.000000] [c0000000015b3d30] [c000000000eb3604] initmem_init+0xabc/0xd68
[ 0.000000] [c0000000015b3e50] [c000000000eab418] setup_arch+0x270/0x300
[ 0.000000] [c0000000015b3f00] [c000000000ea3ae4] start_kernel+0xc4/0x558
[ 0.000000] [c0000000015b3f90] [c000000000008c6c] start_here_common+0x20/0xa8
[ 0.000000] ---[ end Kernel panic - not syncing: memblock_virt_alloc_try_nid: Failed to allocate 16777216 bytes align=0x1000000 nid=1 from=0xfffffffffffffff max_addr=0x0
[ 0.000000]

Regards
Praveen

== Comment: #1 - Praveen K. Pandey <email address hidden> - 2016-07-17 02:40:23 ==

== Comment: #14 - SRIKAR DRONAMRAJU <email address hidden> - 2016-08-31 11:02:28 ==
V3 was posted upstream at http://<email address hidden>.

That should atleast solve the problem (atleast it wouldnt panic/hang on triggering fadump)

The patches posted were on top of 4.8-rc3 and apply cleanly on v4.4
I am not sure what is the kernel targeted for 16.10. I hear its going to be based on v4.8
Once we know which kernel version ubuntu is targeting we can backport the patchset accordingly.

== Comment: #18 - Gary M. Gaydos <email address hidden> - 2016-09-14 16:56:11 ==
Hi Canonical: Per this comment with patch set link, this bug appears to be fixed using the 4.40-34 kernel. Of course the 16.10 release will use a newer kernel.

V3 was posted upstream at http://<email address hidden>.

That should atleast solve the problem (atleast it wouldnt panic/hang on triggering fadump)

The patches posted were on top of 4.8-rc3 and apply cleanly on v4.4
I am not sure what is the kernel targeted for 16.10. I hear its going to be based on v4.8
Once we know which kernel version ubuntu is targeting we can backport the patchset accordingly.

Exposing a comment from test that was previously private:
(In reply to comment #16)
> Hi Praveen,
>
> I have applied the patches to the Yakkety kernel source and built the *.deb
> files. I have kept them on powerdev.in.ibm.com. Have sent you the access
> details over email

Hi latha ,

  Thanks i tried with patched kernel and seems me issue is fixed . able to capture FAdump .

Log:

root@ltc-brazos1:~# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinux-4.4.0-34-generic root=UUID=bfdd4041-1b2f-42b1-b202-2c09f781bbcc ro fadump=on quiet splash fadump=on crashkernel=384M-:128M
root@ltc-brazos1:~#

 root@ltc-brazos1:/var/crash# ls
201609140950 kexec_cmd linux-image-4.4.0-34-generic-201609140950.crash
root@ltc-brazos1:/var/crash# cd 201609140950
root@ltc-brazos1:/var/crash/201609140950# ls
dmesg.201609140950 dump.201609140950
root@ltc-brazos1:/var/crash/201609140950#

Regards
Praveen

== Comment: #20 - Hari Krishna Bathini <email address hidden> - 2016-09-23 03:49:36 ==
Mirror the bug so Canonical can pick the fix patches.
Srikar, can you please provide the upstream commit ids of the fix patches..

Thanks
Hari

== Comment: #21 - Hari Krishna Bathini <email address hidden> - 2016-09-23 03:59:17 ==
(In reply to comment #14)
> V3 was posted upstream at
> http://<email address hidden>.
> ibm.com.
>
> That should atleast solve the problem (atleast it wouldnt panic/hang on
> triggering fadump)
>
> The patches posted were on top of 4.8-rc3 and apply cleanly on v4.4
> I am not sure what is the kernel targeted for 16.10. I hear its going to be
> based on v4.8

Yeah. 16.10 -proposed now has v4.8 based kernel..

Thanks
Hari

Revision history for this message
bugproxy (bugproxy) wrote : hmc error screen when fadump triggered

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-143827 severity-critical targetmilestone-inin1610
Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → linux (Ubuntu)
Changed in linux (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → Canonical Kernel Team (canonical-kernel-team)
importance: Undecided → High
status: New → Triaged
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-10-04 04:17 EDT-------
Hi

Trying to Verify this bug on Ubuntu16.10 ( 4.8.0-17) now dump captured but makedumpfile failed so dump was in incomplete.

LOG:

1;-1f[ 92.085437] kdump-tools[10063]: Starting kdump-tools: * running makedumpfile -c -d 31 /proc/vmcore /
p-incomplete
[ 92.095715] kdump-tools[10063]: get_mem_map: Can't distinguish the memory type.
[ 92.096864] kdump-tools[10063]: The kernel version is not supported.
[ 92.097605] kdump-tools[10063]: The makedumpfile operation may be incomplete.
[ 92.098438] kdump-tools[10063]: makedumpfile Failed.
[ 92.099249] kdump-tools[10063]: * kdump-tools: makedumpfile failed, falling back to 'cp'

Ubuntu Yakkety Yak (development branch) ltc-brazos1 hvc0

root@ltc-brazos1:/var/crash/201610040259# ls
vmcore-incomplete
root@ltc-brazos1:/var/crash/201610040259#

Regards
Praveen

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-10-04 05:08 EDT-------
(In reply to comment #27)
> Hi
>
> Trying to Verify this bug on Ubuntu16.10 ( 4.8.0-17) now dump captured
> but makedumpfile failed so dump was in incomplete.
>
> LOG:
>
> 1;-1f[ 92.085437] kdump-tools[10063]:
> Starting kdump-tools: * running makedumpfile -c -d 31 /proc/vmcore /
> p-incomplete
>
> [ 92.095715] kdump-tools[10063]: get_mem_map: Can't distinguish the memory
> type.
>

Bug 146571 / LP Bug 1626269 is being used to track this..

Thanks
Hari

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-10-05 10:23 EDT-------
Hello Hari,

It seems that there is another issue now, correct? Should we open a new defect for it, or, just track it here? I think that Canonical's action is not clear at this moment.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-10-05 14:49 EDT-------
(In reply to comment #29)
> Hello Hari,
>
> It seems that there is another issue now, correct? Should we open a new
> defect for it, or, just track it here? I think that Canonical's action is
> not clear at this moment.

Hi Breno,

There is a bug opened already for the makedumpfile issue (Bug 146571 / LP Bug 1626269)
Praveen mentioned. This issue can be considered resolved/verified.

Thanks
Hari

------- Comment From <email address hidden> 2016-10-05 14:53 EDT-------
Based on (In reply to comment #30)
> (In reply to comment #29)
> > Hello Hari,
> >
> > It seems that there is another issue now, correct? Should we open a new
> > defect for it, or, just track it here? I think that Canonical's action is
> > not clear at this moment.
>
> Hi Breno,
>
> There is a bug opened already for the makedumpfile issue (Bug 146571 / LP
> Bug 1626269)
> Praveen mentioned. This issue can be considered resolved/verified.

as the fix patches are in 4.8-rc3 and 16.10 is now based on 4.8, the fixes
must be in...

>
> Thanks
> Hari

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.