Ubuntu 18.04 kernel crashed while in degraded mode

Bug #1770849 reported by bugproxy on 2018-05-12
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Critical
Canonical Kernel Team
linux (Ubuntu)
Critical
Ubuntu on IBM Power Systems Bug Triage
Bionic
Critical
Joseph Salisbury

Bug Description

== SRU Justification ==
IBM reports a kernel crash with Bionic while in degraded mode(Degraded
cores).

IBM created a patch to resolve this bug and has submitted it upstream:
https://lists.ozlabs.org/pipermail/linuxppc-dev/2018-May/172835.html

The patch has not landed in mainline as of yet, so it is being submitted
as a SAUCE patch.

== Fix ==
UBUNTU: SAUCE: powerpc/perf: Fix memory allocation for core-imc based on num_possible_cpus()

== Regression Potential ==
Low. Limited to powerpc.

== Test Case ==
A test kernel was built with this patch and tested by the original bug reporter.
The bug reporter states the test kernel resolved the bug.

kernel crash

The system is going down NOW!
Sent SIGTERM to all processes
Sent SIGKILL to all processes
[ 64.713154] kexec_core: Starting new kernel
[ 156.281504630,5] OPAL: Switch to big-endian OS
[ 158.440263459,5] OPAL: Switch to little-endian OS
[ 1.889211] Unable to handle kernel paging request for data at address 0x678e549df9e2878c
[ 1.889289] Faulting instruction address: 0xc00000000038aa30
[ 1.889344] Oops: Kernel access of bad area, sig: 11 [#1]
[ 1.889386] LE SMP NR_CPUS=2048 NUMA PowerNV
[ 1.889432] Modules linked in:
[ 1.889468] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 4.15.0-20-generic #21-Ubuntu
[ 1.889545] NIP: c00000000038aa30 LR: c00000000038aa1c CTR: 0000000000000000
[ 1.889608] REGS: c000003fed193840 TRAP: 0380 Not tainted (4.15.0-20-generic)
[ 1.889670] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 28000884 XER: 20040000
[ 1.889742] CFAR: c000000000016e1c SOFTE: 1
[ 1.889742] GPR00: c00000000038a914 c000003fed193ac0 c0000000016eae00 0000000000000001
[ 1.889742] GPR04: c000003fd754c7f8 000000000000002c 0000000000000001 000000000000002b
[ 1.889742] GPR08: 678e549df9e28874 0000000000000000 0000000000000000 fffffffffffffffe
[ 1.889742] GPR12: 0000000028000888 c00000000fa82100 c00000000000d3b8 0000000000000000
[ 1.889742] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 1.889742] GPR20: 0000000000000000 0000000000000000 0000000000000000 a78e54a22eb64f8c
[ 1.889742] GPR24: c000003fd754c800 678e549df9e2878c 0000000000000300 c0000000002bd05c
[ 1.889742] GPR28: c000003fed01ea00 00000000014080c0 c000003fd754c800 c000003fed01ea00
[ 1.890286] NIP [c00000000038aa30] kmem_cache_alloc_trace+0x2d0/0x330
[ 1.890340] LR [c00000000038aa1c] kmem_cache_alloc_trace+0x2bc/0x330
[ 1.890391] Call Trace:
[ 1.890416] [c000003fed193ac0] [c00000000038a914] kmem_cache_alloc_trace+0x1b4/0x330 (unreliable)
[ 1.890491] [c000003fed193b30] [c0000000002bd05c] pmu_dev_alloc+0x3c/0x170
[ 1.890547] [c000003fed193bb0] [c0000000010e3210] perf_event_sysfs_init+0x8c/0xf0
[ 1.890611] [c000003fed193c40] [c00000000000d144] do_one_initcall+0x64/0x1d0
[ 1.890676] [c000003fed193d00] [c0000000010b4400] kernel_init_freeable+0x280/0x374
[ 1.890740] [c000003fed193dc0] [c00000000000d3d4] kernel_init+0x24/0x160
[ 1.890795] [c000003fed193e30] [c00000000000b528] ret_from_kernel_thread+0x5c/0xb4
[ 1.890857] Instruction dump:
[ 1.890909] 7c97ba78 fb210038 38a50001 7f19ba78 fb290000 f8aa0000 4bc8c3f1 60000000
[ 1.890978] 7fb8b840 419e0028 e93f0022 e91f0140 <7d59482a> 7d394a14 7d4a4278 7fa95040
[ 1.891050] ---[ end trace 41b3fe7a827f3888 ]---
[ 2.900027]
[ 3.900175] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[ 3.900175]
[ 4.71868[ 175.340944355,5] OPAL: Reboot request...
2] Rebooting in 10 seconds..

This fix is needed to resolve the crash

https://lists.ozlabs.org/pipermail/linuxppc-dev/2018-May/172835.html

CVE References

bugproxy (bugproxy) on 2018-05-12
tags: added: architecture-ppc64le bugnameltc-167482 severity-critical targetmilestone-inin1804
Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → linux (Ubuntu)
Changed in ubuntu-power-systems:
status: New → Triaged
importance: Undecided → Critical
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
tags: added: triage-g

------- Comment From <email address hidden> 2018-05-14 13:11 EDT-------
Tested this issue by applying the above patch on top of current bionic source, and it worked fine.

tags: added: kernel-da-key
Changed in linux (Ubuntu):
importance: Undecided → Critical
status: New → Triaged
Changed in linux (Ubuntu Bionic):
status: New → Triaged
importance: Undecided → Critical
assignee: nobody → Joseph Salisbury (jsalisbury)
status: Triaged → In Progress
Changed in linux (Ubuntu):
status: Triaged → In Progress
Joseph Salisbury (jsalisbury) wrote :

I built a test kernel with the requested patch. The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1770849

Can you test this kernel and see if it resolves this bug?

Note about installing test kernels:
• If the test kernel is prior to 4.15(Bionic) you need to install the linux-image and linux-image-extra .deb packages.
• If the test kernel is 4.15(Bionic) or newer, you need to install the linux-image-unsigned, linux-modules and linux-modules-extra .deb packages.

Thanks in advance!

Changed in ubuntu-power-systems:
status: Triaged → In Progress
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-05-15 02:41 EDT-------
Tested the above kernel and it worked fine with degraded cores. Not seen any kernel crashes. Thanks.

Joseph Salisbury (jsalisbury) wrote :
description: updated
description: updated
Stefan Bader (smb) on 2018-05-23
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Changed in ubuntu-power-systems:
status: In Progress → Fix Committed
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-05-25 02:02 EDT-------
Tested the proposed kernel, and not seen the issue, tests ran fine.

linux-generic/bionic-proposed 4.15.0.23.24 ppc64el [upgradable from: 4.15.0.20.23]
linux-headers-generic/bionic-proposed 4.15.0.23.24 ppc64el [upgradable from: 4.15.0.20.23]
linux-image-generic/bionic-proposed 4.15.0.23.24 ppc64el [upgradable from: 4.15.0.20.23]

uname -a
Linux ltc-boston125 4.15.0-23-generic #25-Ubuntu SMP Wed May 23 17:59:00 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux

tags: added: verification-done-bionic
removed: verification-needed-bionic
Changed in linux (Ubuntu):
status: In Progress → Fix Committed
Launchpad Janitor (janitor) wrote :
Download full text (11.4 KiB)

This bug was fixed in the package linux - 4.15.0-23.25

---------------
linux (4.15.0-23.25) bionic; urgency=medium

  * linux: 4.15.0-23.25 -proposed tracker (LP: #1772927)

  * arm64 SDEI support needs trampoline code for KPTI (LP: #1768630)
    - arm64: mmu: add the entry trampolines start/end section markers into
      sections.h
    - arm64: sdei: Add trampoline code for remapping the kernel

  * Some PCIe errors not surfaced through rasdaemon (LP: #1769730)
    - ACPI: APEI: handle PCIe AER errors in separate function
    - ACPI: APEI: call into AER handling regardless of severity

  * qla2xxx: Fix page fault at kmem_cache_alloc_node() (LP: #1770003)
    - scsi: qla2xxx: Fix session cleanup for N2N
    - scsi: qla2xxx: Remove unused argument from qlt_schedule_sess_for_deletion()
    - scsi: qla2xxx: Serialize session deletion by using work_lock
    - scsi: qla2xxx: Serialize session free in qlt_free_session_done
    - scsi: qla2xxx: Don't call dma_free_coherent with IRQ disabled.
    - scsi: qla2xxx: Fix warning in qla2x00_async_iocb_timeout()
    - scsi: qla2xxx: Prevent relogin trigger from sending too many commands
    - scsi: qla2xxx: Fix double free bug after firmware timeout
    - scsi: qla2xxx: Fixup locking for session deletion

  * Several hisi_sas bug fixes (LP: #1768974)
    - scsi: hisi_sas: dt-bindings: add an property of signal attenuation
    - scsi: hisi_sas: support the property of signal attenuation for v2 hw
    - scsi: hisi_sas: fix the issue of link rate inconsistency
    - scsi: hisi_sas: fix the issue of setting linkrate register
    - scsi: hisi_sas: increase timer expire of internal abort task
    - scsi: hisi_sas: remove unused variable hisi_sas_devices.running_req
    - scsi: hisi_sas: fix return value of hisi_sas_task_prep()
    - scsi: hisi_sas: Code cleanup and minor bug fixes

  * [bionic] machine stuck and bonding not working well when nvmet_rdma module
    is loaded (LP: #1764982)
    - nvmet-rdma: Don't flush system_wq by default during remove_one
    - nvme-rdma: Don't flush delete_wq by default during remove_one

  * Warnings/hang during error handling of SATA disks on SAS controller
    (LP: #1768971)
    - scsi: libsas: defer ata device eh commands to libata

  * Hotplugging a SATA disk into a SAS controller may cause crash (LP: #1768948)
    - ata: do not schedule hot plug if it is a sas host

  * ISST-LTE:pKVM:Ubuntu1804: rcu_sched self-detected stall on CPU follow by CPU
    ATTEMPT TO RE-ENTER FIRMWARE! (LP: #1767927)
    - powerpc/powernv: Handle unknown OPAL errors in opal_nvram_write()
    - powerpc/64s: return more carefully from sreset NMI
    - powerpc/64s: sreset panic if there is no debugger or crash dump handlers

  * fsnotify: Fix fsnotify_mark_connector race (LP: #1765564)
    - fsnotify: Fix fsnotify_mark_connector race

  * Hang on network interface removal in Xen virtual machine (LP: #1771620)
    - xen-netfront: Fix hang on device removal

  * HiSilicon HNS NIC names are truncated in /proc/interrupts (LP: #1765977)
    - net: hns: Avoid action name truncation

  * Ubuntu 18.04 kernel crashed while in degraded mode (LP: #1770849)
    - SAUCE: powerpc/perf: Fix memory allocation for...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
Changed in ubuntu-power-systems:
status: Fix Committed → Fix Released
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-06-20 17:20 EDT-------
@pridhiviraj Hello Pridhiviraj, can this issue be closed now?

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-06-21 01:37 EDT-------
(In reply to comment #21)
>
> @pridhiviraj Hello Pridhiviraj, can this issue be closed now?

Yes, this is tested with 4.15.0-24-generic kernel. So we can close the issue.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers