linux: kdump on Ubuntu 14.04 is not generating a dump.

Bug #1352056 reported by bugproxy
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Undecided
Unassigned
Trusty
Fix Released
High
Chris J Arges
Utopic
Fix Released
Undecided
Unassigned

Bug Description

SRU Justification:

[Impact]
Users of ppc64el hardware need the ability to use crashdumps to do kernel debugging.

[Fix]
Commit upstream and already in utopic:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=429d2e8342954d337abe370d957e78291032d867

[Test Case]
Taken from:
https://wiki.ubuntu.com/Kernel/CrashdumpRecipe
https://help.ubuntu.com/14.04/serverguide/kernel-crash-dump.html

1) apt-get install linux-crashdump
2) increase crashdump size:
sudo vim /etc/default/grub.d/kexec-tools.cfg
set crashkernel=1024M
sudo update-grub
3) reboot the machine
4) sudo sed -i 's/USE_KDUMP=0/USE_KDUMP=1/g' /etc/default/kdump-tools
5) kdump-config show # should return no errors
6) echo 'c' | sudo tee /proc/sysrq-trigger
7) This should crash the machine and we should kexec into another kernel to dump the core, then on the next reboot we should see a crash in /var/crash/*

--

---Problem Description---
kdump is not producing a dump on powerKVM LE P8 Ubuntu 14.04

---uname output---
3.13.0-30-generic

---Additional Hardware Info---
Power8 LE configuration.

---Patches Installed---
1324544 - kdump-config load fails with vmlinux kernel (vs. vmlinuz)

Machine Type = 8247-22L

---Steps to Reproduce---
Installed kdump-tools 1.5.5-2ubuntu1 and crash 7.0.3-3ubuntu3.
Updated /etc/default/kdump-tools, first I updated just USE_KDUMP=1. Rebooted the node and see:
root=UUID=87986483-5fec-4b4d-b22e-bf2a72096df8 ro quiet splash crashkernel=384M-:128M
root@c656f2n02:~# cat /proc/sys/kernel/sysrq
1
root@c656f2n02:~# cat /proc/sys/kernel/sysrq
1
root@c656f2n02:~# ^Cnd /proc | grep sysrq
root@c656f2n02:~# kdump-config status
current state : ready to kdump
root@c656f2n02:~# kdump-config show
USE_KDUMP: 1
KDUMP_SYSCTL: kernel.panic_on_oops=1
KDUMP_COREDIR: /var/crash
crashkernel addr:
current state: ready to kdump

kexec command:
  /sbin/kexec -p --args-linux --command-line="root=UUID=87986483-5fec-4b4d-b22e-bf2a72096df8 ro quiet splash irqpoll maxcpus=1 nousb" --initrd=/boot/initrd.img-3.13.0-30-generic /boot/vmlinux-3.13.0-30-generic

root@c656f2n02:/boot/grub# cat /sys/kernel/kexec_crash_loaded
1
root@c656f2n02:/boot/grub# cat /sys/kernel/kexec_loaded
0

echo c > /proc/sysrq-trigger

root@c656f2n02:/var/log# echo c > /proc/sysrq-trigger
[ 1956.014243] SysRq : Trigger a crash
[ 1956.014328] Unable to handle kernel paging request for data at address 0x00000000
[ 1956.014404] Faulting instruction address: 0xc000000000586c2c
[ 1956.014468] Oops: Kernel access of bad area, sig: 11 [#1]
[ 1956.014518] SMP NR_CPUS=2048 NUMA PowerNV
[ 1956.014570] Modules linked in: ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp bridge stp llc ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables autofs4 rdma_ucm(OF) ib_ucm(OF) rdma_cm(OF) iw_cm(OF) ib_ipoib(OF) ib_cm(OF) ib_uverbs(OF) ib_umad(OF) mlx5_ib(OF) mlx5_core(OF) mlx4_ib(OF) ib_sa(OF) ib_mad(OF) ib_core(OF) ib_addr(OF) mlx4_en(OF) mlx4_core(OF) compat(OF) nfsd auth_rpcgss nfs_acl nfs lockd sunrpc fscache rtc_generic powernv_rng ses enclosure ipr
[ 1956.015306] CPU: 146 PID: 2522 Comm: bash Tainted: GF O 3.13.0-30-generic #54-Ubuntu
[ 1956.015394] task: c000003fcabda120 ti: c000003fcac58000 task.ti: c000003fcac58000
[ 1956.015469] NIP: c000000000586c2c LR: c000000000587b8c CTR: c000000000586c00
[ 1956.015543] REGS: c000003fcac5b820 TRAP: 0300 Tainted: GF O (3.13.0-30-generic)
[ 1956.015617] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 42422822 XER: 20000000
[ 1956.015804] CFAR: c000000000009318 DAR: 0000000000000000 DSISR: 42000000 SOFTE: 0
GPR00: c000000000587b8c c000003fcac5baa0 c00000000162e840 0000000000000063
GPR04: c000000002f45bd0 c000000002f564c8 0000000000015ad0 c000000001827480
GPR08: c000000000dfe840 0000000000000000 0000000000000001 0000000000015ad0
GPR12: 0000000042422822 c000000007e5ff00 000001002fe90648 000000001016e008
GPR16: 000000001013ad70 000001002fe94648 000000001016fed0 000000001016e008
GPR20: 00000000100c31e0 0000000000000000 0000000010171fc8 000000001016f840
GPR24: 0000000000000004 0000000000000000 0000000000000001 c0000000014b7dc8
GPR28: c000000001974c90 0000000000000063 c00000000148d9c0 c0000000014b8188
[ 1956.016794] NIP [c000000000586c2c] .sysrq_handle_crash+0x2c/0x40
[ 1956.016858] LR [c000000000587b8c] .__handle_sysrq+0xfc/0x260
[ 1956.016920] Call Trace:
[ 1956.016948] [c000003fcac5baa0] [0000000010172a34] 0x10172a34 (unreliable)
[ 1956.017025] [c000003fcac5bb10] [c000000000587b8c] .__handle_sysrq+0xfc/0x260
[ 1956.017101] [c000003fcac5bbd0] [c000000000588324] .write_sysrq_trigger+0x74/0x90
[ 1956.017190] [c000003fcac5bc50] [c0000000002dff1c] .proc_reg_write+0xac/0x110
[ 1956.017266] [c000003fcac5bcf0] [c000000000254c00] .vfs_write+0xe0/0x260
[ 1956.017342] [c000003fcac5bd90] [c0000000002558f4] .SyS_write+0x64/0xe0
[ 1956.017418] [c000003fcac5be30] [c00000000000a158] syscall_exit+0x0/0x98
[ 1956.017492] Instruction dump:
[ 1956.017530] 4bffffac 7c0802a6 f8010010 f821ff91 60000000 60000000 3d42001f 392a8ca8
[ 1956.017658] 39400001 91490000 7c0004ac 39200000 <99490000> 38210070 e8010010 7c0803a6
[ 1956.017894] ---[ end trace d163ff42366bde72 ]---
[ 1956.017986]
[ 1956.018042] Sending IPI to other CPUs
[ 1956.019188] IPI complete
 -> smp_release_cpus()
spinning_secondaries = 159
 <- smp_release_cpus()
 <- setup_system()
The console stays remains at this message until I power cycle the cec. There is no /proc/vmcore on reboot.

I recreated the hang on my victim node.
Some CPUs are hitting the 4400's interrupt vector. I think this is due to the commit 429d2e834295 "powerpc: Fix kdump hang issue on p8 with relocation on exception enabled." from Mahesh but I need to double check that since it may not be only patch missing.

Definitively, the patch I mentioned is fixing the hang.
Here are the commit details :

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=429d2e8342954d337abe370d957e78291032d867

powerpc: Fix kdump hang issue on p8 with relocation on exception enabled.

On p8 systems, with relocation on exception feature enabled we are seeing
kdump kernel hang at interrupt vector 0xc*4400. The reason is, with this
feature enabled, exception are raised with MMU (IR=DR=1) ON with the
default offset of 0xc*4000. Since exception is raised in virtual mode it
requires the vector region to be executable without which it fails to
fetch and execute instruction at 0xc*4xxx. For default kernel since kernel
is loaded at real 0, the htab mappings sets the entire kernel text region
executable. But for relocatable kernel (e.g. kdump case) we only copy
interrupt vectors down to real 0 and never marked that region as
executable because in p7 and below we always get exception in real mode.

This patch fixes this issue by marking htab mapping range as executable
that overlaps with the interrupt vector region for relocatable kernel.

Thanks to Ben who helped me to debug this issue and find the root cause.

Signed-off-by: Mahesh Salgaonkar <email address hidden>
Signed-off-by: Benjamin Herrenschmidt <email address hidden>

I think this bug should be mirrored to Ubuntu so they can include this patch in the 14.04 kernel, and may be also in the 14.10 kernel too.

bugproxy (bugproxy)
tags: added: architecture-ppc64le bugnameltc-112931 severity-high targetmilestone-inin1404
Luciano Chavez (lnx1138)
affects: ubuntu → linux (Ubuntu)
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1352056

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: trusty
Luciano Chavez (lnx1138)
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Changed in linux (Ubuntu):
importance: Undecided → Medium
tags: added: ppc64el
Chris J Arges (arges)
Changed in linux (Ubuntu):
assignee: nobody → Chris J Arges (arges)
importance: Medium → High
status: Confirmed → In Progress
Changed in linux (Ubuntu Trusty):
assignee: nobody → Chris J Arges (arges)
importance: Undecided → High
status: New → In Progress
summary: - kdump on Ubuntu 14.04 is not generating a dump.
+ linux: kdump on Ubuntu 14.04 is not generating a dump.
Chris J Arges (arges)
Changed in linux (Ubuntu Utopic):
status: In Progress → Fix Released
assignee: Chris J Arges (arges) → nobody
importance: High → Medium
importance: Medium → Undecided
Chris J Arges (arges)
description: updated
Chris J Arges (arges)
description: updated
Revision history for this message
Chris J Arges (arges) wrote :

SRU sent to kernel ML.

Revision history for this message
Chris J Arges (arges) wrote :

FYI tested with patched kernel + fixes for kexec-tools and makedumpfile and was able to crash kernel and run crash on dump on 3.13.

Tim Gardner (timg-tpi)
Changed in linux (Ubuntu Trusty):
status: In Progress → Fix Committed
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Hi Chris,

Thanks for fixing this out.

One question:

> [Test Case]
> Taken from:
> https://wiki.ubuntu.com/Kernel/CrashdumpRecipe
> [...]
> 2) increase crashdump size:
> sudo vim /etc/default/grub.d/kexec-tools.cfg
> set crashkernel=1024M

Is there any possibility to make this value something 'per-arch?
So we can set more ppc64el, and not mess w/ other arches.

Like that, crashdump would work by default (at least on ppc64el), users not having to manually adjust the crashkernel size.

I'd be happy to provide a patch/ideas if you think it's OK to go. (I understand it might have to be later, for the feature-freeze took over some time ago).

Revision history for this message
Chris J Arges (arges) wrote :

@mauricfo

There is an incoming patch to run kdump as a lower runlevel which may help reduce memory usage after the kexec; I'd like to see if that fixes the issue without having to bump up memory consumption to 1G.

Either way feel free to track this as a separate issue. First and foremost we should get the entire stack working in utopic/trusty with this small modification; then fix this as well.

Thanks!

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

@arges

Ok! Thanks for letting us know.

Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-trusty' to 'verification-done-trusty'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-trusty
Revision history for this message
Chris J Arges (arges) wrote :

Just verified this in a trusty power8 KVM.

tags: added: verification-done-trusty
removed: verification-needed-trusty
Revision history for this message
Breno Leitão (breno-leitao) wrote :

Hi Chris,

Regarding the fix for trusty, are you waiting for 14.04.2 release, which is going to contain the 3.16 kernel?

Revision history for this message
Chris J Arges (arges) wrote :

Hi Breno,
This fix was also applied to the 3.13 series Trusty kernel as well. So Trusty w/ 3.13, Trusty w/ 3.16 and any newer releases should work once all the SRUs complete and the fixes land in their respective packages.
--chris

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (10.4 KiB)

This bug was fixed in the package linux - 3.13.0-39.66

---------------
linux (3.13.0-39.66) trusty; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1386629

  [ Upstream Kernel Changes ]

  * KVM: x86: Check non-canonical addresses upon WRMSR
    - LP: #1384539
    - CVE-2014-3610
  * KVM: x86: Prevent host from panicking on shared MSR writes.
    - LP: #1384539
    - CVE-2014-3610
  * KVM: x86: Improve thread safety in pit
    - LP: #1384540
    - CVE-2014-3611
  * KVM: x86: Fix wrong masking on relative jump/call
    - LP: #1384545
    - CVE-2014-3647
  * KVM: x86: Warn if guest virtual address space is not 48-bits
    - LP: #1384545
    - CVE-2014-3647
  * KVM: x86: Emulator fixes for eip canonical checks on near branches
    - LP: #1384545
    - CVE-2014-3647
  * KVM: x86: emulating descriptor load misses long-mode case
    - LP: #1384545
    - CVE-2014-3647
  * KVM: x86: Handle errors when RIP is set during far jumps
    - LP: #1384545
    - CVE-2014-3647
  * kvm: vmx: handle invvpid vm exit gracefully
    - LP: #1384544
    - CVE-2014-3646
  * Input: synaptics - gate forcepad support by DMI check
    - LP: #1381815

linux (3.13.0-38.65) trusty; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1379244

  [ Andy Whitcroft ]

  * Revert "SAUCE: scsi: hyper-v storsvc switch up to SPC-3"
    - LP: #1354397
  * [Config] linux-image-extra is additive to linux-image
    - LP: #1375310
  * [Config] linux-image-extra postrm is not needed on purge
    - LP: #1375310

  [ Upstream Kernel Changes ]

  * Revert "KVM: x86: Increase the number of fixed MTRR regs to 10"
    - LP: #1377564
  * Revert "USB: option,zte_ev: move most ZTE CDMA devices to zte_ev"
    - LP: #1377564
  * aufs: bugfix, stop calling security_mmap_file() again
    - LP: #1371316
  * ipvs: fix ipv6 hook registration for local replies
    - LP: #1349768
  * Drivers: add blist flags
    - LP: #1354397
  * sd: fix a bug in deriving the FLUSH_TIMEOUT from the basic I/O timeout
    - LP: #1354397
  * drm/i915/bdw: Add 42ms delay for IPS disable
    - LP: #1374389
  * drm/i915: add null render states for gen6, gen7 and gen8
    - LP: #1374389
  * drm/i915/bdw: 3D_CHICKEN3 has write mask bits
    - LP: #1374389
  * drm/i915/bdw: Disable idle DOP clock gating
    - LP: #1374389
  * drm/i915: call lpt_init_clock_gating on BDW too
    - LP: #1374389
  * drm/i915: shuffle panel code
    - LP: #1374389
  * drm/i915: extract backlight minimum brightness from VBT
    - LP: #1374389
  * drm/i915: respect the VBT minimum backlight brightness
    - LP: #1374389
  * drm/i915/bdw: Apply workarounds in render ring init function
    - LP: #1374389
  * drm/i915/bdw: Cleanup pre prod workarounds
    - LP: #1374389
  * drm/i915: Replace hardcoded cacheline size with macro
    - LP: #1374389
  * drm/i915: Refactor Broadwell PIPE_CONTROL emission into a helper.
    - LP: #1374389
  * drm/i915: Add the WaCsStallBeforeStateCacheInvalidate:bdw workaround.
    - LP: #1374389
  * drm/i915/bdw: Remove BDW preproduction W/As until C stepping.
    - LP: #1374389
  * mptfusion: enable no_write_same for vmware scsi disks
    - LP: #1371591
  * iommu/amd: Fix cleanup_domai...

Changed in linux (Ubuntu Trusty):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.