Ubuntu 14.04.3 LTS Crash in notifier_call_chain after boot

Bug #1487085 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Undecided
Unassigned
Vivid
Fix Released
High
Chris J Arges

Bug Description

SRU Justification:
[Impact]
Users of 3.19 kernel with power8 machines get a kernel crash on boot.

[Test Case]
Boot system.

[Fix]
commit 792f96e9a769b799a2944e9369e4ea1e467135b2 needed to be backported in addition to d7cf83fcaf1b1668201eae4cdd6e6fe7a2448654. Our 3.19 kernel had a partial backport of the first patch.

--

---Problem Description---
Installed Ubuntu 14.04.3 LTS on Palmetto and its crashing after booting to login.
This happens every time I boot Ubuntu 14.04.3 LTS. I've reinstalled Ubuntu and replaced the hard disk as well and re-installed. Still crashing.

---uname output---
Linux paul40 3.19.0-26-generic #28~14.04.1-Ubuntu SMP Wed Aug 12 14:10:52 UTC 2015 ppc64le ppc64le ppc64le GNU/Linux

Machine Type = Palmetto

---System Hang---
 Ubuntu OS crashes and cannot access host. Must reboot system

---Steps to Reproduce---
 Boot system

Oops output:
 [ 33.132376] Unable to handle kernel paging request for data at address 0x200000000000000
    [ 33.132565] Faulting instruction address: 0xc0000000000dbc60
    [ 33.133422] Oops: Kernel access of bad area, sig: 11 [#1]
    [ 33.134410] SMP NR_CPUS=2048 NUMA PowerNV
    [ 33.134478] Modules linked in: ast ttm drm_kms_helper joydev mac_hid drm hid_generic usbhid hid syscopyarea sysfillrect sysimgblt i2c_algo_bit ofpart cmdlinepart at24 uio_pdrv_genirq powernv_flash mtd ipmi_powernv powernv_rng opal_prd ipmi_msghandler uio uas usb_storage ahci libahci
    [ 33.139112] CPU: 24 PID: 0 Comm: swapper/24 Not tainted 3.19.0-26-generic #28~14.04.1-Ubuntu
    [ 33.139943] task: c0000000013cccb0 ti: c000000fff700000 task.ti: c000000001448000
    [ 33.141642] NIP: c0000000000dbc60 LR: c0000000000dbd94 CTR: 0000000000000000
    [ 33.142605] REGS: c000000fff703980 TRAP: 0300 Not tainted (3.19.0-26-generic)
    [ 33.143417] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 28002888 XER: 00000000
    [ 33.144244] CFAR: c000000000008468 DAR: 0200000000000000 DSISR: 40000000 SOFTE: 0
    GPR00: c0000000000dbd94 c000000fff703c00 c00000000144cc00 c0000000015f03c0
    GPR04: 0000000000000007 c0000000015f03b8 ffffffffffffffff 0000000000000000
    GPR08: 0000000000000000 0200000000000000 c00000000006c394 9000000000001003
    GPR12: 0000000000002200 c00000000fb8d800 0000000000000058 0000000000000000
    GPR16: c000000001448000 c000000001448000 c000000001448080 c000000000e9a880
    GPR20: c000000001448080 0000000000000001 0000000000000002 0000000000000012
    GPR24: c000000f1e432200 0000000000000000 0000000000000000 c0000000015f03b8
    GPR28: 0000000000000007 0000000000000000 c0000000015f03c0 ffffffffffffffff
    [ 33.157013] NIP [c0000000000dbc60] notifier_call_chain+0x70/0x100
    [ 33.157818] LR [c0000000000dbd94] atomic_notifier_call_chain+0x44/0x60
    [ 33.162090] Call Trace:
    [ 33.162845] [c000000fff703c00] [0000000000000008] 0x8 (unreliable)
    [ 33.163644] [c000000fff703c50] [c0000000000dbd94] atomic_notifier_call_chain+0x44/0x60
    [ 33.164647] [c000000fff703c90] [c00000000006f2a8] opal_message_notify+0xa8/0x100
    [ 33.165476] [c000000fff703d00] [c0000000000dbc88] notifier_call_chain+0x98/0x100
    [ 33.167007] [c000000fff703d50] [c0000000000dbd94] atomic_notifier_call_chain+0x44/0x60
    [ 33.167816] [c000000fff703d90] [c00000000006f654] opal_do_notifier.part.5+0x74/0xa0
    [ 33.172166] [c000000fff703dd0] [c00000000006f6d8] opal_interrupt+0x58/0x70
    [ 33.172997] [c000000fff703e10] [c0000000001273d0] handle_irq_event_percpu+0x90/0x2b0
    [ 33.174507] [c000000fff703ed0] [c000000000127658] handle_irq_event+0x68/0xd0
    [ 33.175312] [c000000fff703f00] [c00000000012baf4] handle_fasteoi_irq+0xe4/0x240
    [ 33.176124] [c000000fff703f30] [c0000000001265c8] generic_handle_irq+0x58/0x90
    [ 33.176936] [c000000fff703f60] [c000000000010f10] __do_irq+0x80/0x190
    [ 33.182406] [c000000fff703f90] [c00000000002476c] call_do_irq+0x14/0x24
    [ 33.183258] [c00000000144ba30] [c0000000000110c0] do_IRQ+0xa0/0x120
    [ 33.184072] [c00000000144ba90] [c0000000000025d8] hardware_interrupt_common+0x158/0x180
    [ 33.184907] --- interrupt: 501 at arch_local_irq_restore+0x5c/0x90
    [ 33.184907] LR = arch_local_irq_restore+0x40/0x90
    [ 33.186473] [c00000000144bd80] [c000000f2ae19808] 0xc000000f2ae19808 (unreliable)
    [ 33.188024] [c00000000144bda0] [c00000000085d5d8] cpuidle_enter_state+0xa8/0x260
    [ 33.192695] [c00000000144be00] [c000000000108be8] cpu_startup_entry+0x488/0x4e0
    [ 33.193543] [c00000000144bee0] [c00000000000bdb4] rest_init+0xa4/0xc0
    [ 33.194327] [c00000000144bf00] [c000000000da3e80] start_kernel+0x53c/0x558
    [ 33.195084] [c00000000144bf90] [c000000000008c6c] start_here_common+0x20/0xa8
    [ 33.196569] Instruction dump:
    [ 33.196619] 7cfd3b78 60000000 60000000 e93e0000 2fa90000 419e00a4 2fbf0000 419e009c
    [ 33.197605] 2e3d0000 60000000 60000000 60420000 <e9490000> ebc90008 7d234b78 7f84e378
    [ 33.202763] ---[ end trace 71076895a9f126ba ]---
    [ 33.202836]
    [ 35.203605] Kernel panic - not syncing: Fatal exception in interrupt
    [ 35.203727] drm_kms_helper: panic occurred, switching back to text console
    [ 35.204692] ---[ end Kernel panic - not syncing: Fatal exception in interrupt

Ah! This is due to notifier chain array overflow while handling opal message. The upstream commit 792f96e fixes this issue.. But what I see is the commit 792f96e has been partially applied to ubuntu 14.04.3 kernel sources. And hence you are seeing this issue.

commit 792f96e9a769b799a2944e9369e4ea1e467135b2
Author: Neelesh Gupta <email address hidden>
Date: Wed Feb 11 11:57:06 2015 +0530

    powerpc/powernv: Fix the overflow of OPAL message notifiers head array

    Fixes the condition check of incoming message type which can
    otherwise shoot beyond the message notifiers head array.

    Signed-off-by: Neelesh Gupta <email address hidden>
    Reviewed-by: Vasant Hegde <email address hidden>
    Reviewed-by: Anshuman Khandual <email address hidden>
    Signed-off-by: Benjamin Herrenschmidt <email address hidden>

Below is the hunk from above commit, which is missing from ubuntu 14.04.3:
------------------------------------------------
@@ -354,7 +350,7 @@ static void opal_handle_message(void)
        type = be32_to_cpu(msg.msg_type);

        /* Sanity check */
- if (type > OPAL_MSG_TYPE_MAX) {
+ if (type >= OPAL_MSG_TYPE_MAX) {
                pr_warning("%s: Unknown message type: %u\n", __func__, type);
                return;
        }
------------------------------------------------

I just checked. The above hunk can be cleanly applied to ubuntu 14.04.3 kernel sources. We should mirror this bug to ubuntu and ask them to apply above hunk.

bugproxy (bugproxy)
tags: added: architecture-ppc64le bugnameltc-129216 severity-high targetmilestone-inin14043
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1487085/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
Luciano Chavez (lnx1138)
affects: ubuntu → linux (Ubuntu)
Changed in linux (Ubuntu):
assignee: nobody → Taco Screen team (taco-screen-team)
Chris J Arges (arges)
Changed in linux (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → Chris J Arges (arges)
importance: Undecided → High
status: New → In Progress
Chris J Arges (arges)
Changed in linux (Ubuntu):
assignee: Chris J Arges (arges) → nobody
Changed in linux (Ubuntu Vivid):
assignee: nobody → Chris J Arges (arges)
importance: Undecided → High
status: New → In Progress
Changed in linux (Ubuntu):
status: In Progress → New
importance: High → Undecided
description: updated
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2015-08-24 06:19 EDT-------
*** Bug 129349 has been marked as a duplicate of this bug. ***

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2015-08-25 01:38 EDT-------
A patched kernel was tested and found to resolve the panic.

bugproxy (bugproxy)
tags: added: severity-critical
removed: severity-high
Brad Figg (brad-figg)
Changed in linux (Ubuntu Vivid):
status: In Progress → Fix Committed
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2015-08-25 20:52 EDT-------
*** Bug 129441 has been marked as a duplicate of this bug. ***

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2015-08-26 16:32 EDT-------
*** Bug 129458 has been marked as a duplicate of this bug. ***

------- Comment From <email address hidden> 2015-08-26 16:40 EDT-------
Hi Chris,

This problem is starting to become prevalent among the testers. Any outlook as to when this fix may get released? Thanks.

Revision history for this message
Tim Gardner (timg-tpi) wrote :

'powerpc/powernv: Fix the overflow of OPAL message notifiers head array' has been applied to Vivid and will be released in due course in version 3.19.0-28.30

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2015-08-27 18:27 EDT-------
*** Bug 129396 has been marked as a duplicate of this bug. ***

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2015-08-27 18:34 EDT-------
*** Bug 129413 has been marked as a duplicate of this bug. ***

------- Comment From <email address hidden> 2015-08-27 18:40 EDT-------
*** Bug 129413 has been marked as a duplicate of this bug. ***

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2015-08-27 18:42 EDT-------
*** Bug 129437 has been marked as a duplicate of this bug. ***

Revision history for this message
Breno Leitão (breno-leitao) wrote :

Probably to be fix released in Sep 15th...

bugproxy (bugproxy)
tags: removed: bot-comment bugnameltc-129216 severity-critical
bugproxy (bugproxy)
tags: added: bugnameltc-129216 severity-critical
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2015-09-03 16:44 EDT-------
*** Bug 129843 has been marked as a duplicate of this bug. ***

Revision history for this message
Breno Leitão (breno-leitao) wrote :

I understand that this fix didn't make linux-image-3.19.0-28-generic, right?

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2015-09-10 06:10 EDT-------
*** Bug 128635 has been marked as a duplicate of this bug. ***

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2015-09-11 16:40 EDT-------
*** Bug 130247 has been marked as a duplicate of this bug. ***

Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-vivid' to 'verification-done-vivid'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-vivid
Revision history for this message
Breno Leitão (breno-leitao) wrote :

Brad,

In which kernel the fix is available? Is it at version 3.19.0-29.31 ?

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2015-09-15 21:12 EDT-------
*** Bug 130514 has been marked as a duplicate of this bug. ***

Revision history for this message
Luis Henriques (henrix) wrote :

Breno: yes, 3.19.0-29.31 is the kernel that contains the fix for this bug and that is currently in -proposed.

tags: added: verification-done-vivid
removed: verification-needed-vivid
bugproxy (bugproxy)
tags: added: verification-needed-vivid
removed: verification-done-vivid
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2015-09-16 22:24 EDT-------
*** Bug 130573 has been marked as a duplicate of this bug. ***

Revision history for this message
Breno Leitão (breno-leitao) wrote :

Tested internally at IBM. Thanks!

tags: added: verification-vivid
removed: verification-needed-vivid
bugproxy (bugproxy)
tags: added: verification-done-vivid
removed: verification-vivid
Chris J Arges (arges)
Changed in linux (Ubuntu):
status: New → Fix Released
Revision history for this message
Breno Leitão (breno-leitao) wrote :

 The fix for this bug should be included in the next kernel SRU which is scheduled for 9/26.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2015-09-25 01:02 EDT-------
*** Bug 131027 has been marked as a duplicate of this bug. ***

Revision history for this message
Breno Leitão (breno-leitao) wrote :

I understand that this fix didn't make the 3.19 SRU kernel, and might be released in the next SRU cycle (in 3 weeks).

Revision history for this message
Breno Leitão (breno-leitao) wrote :

I just talked to Tim, and the kernel with this fix should be promoted to the -updates archive in a few days.

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (13.2 KiB)

This bug was fixed in the package linux - 3.19.0-30.33

---------------
linux (3.19.0-30.33) vivid; urgency=low

  [ Brad Figg ]

  * Release Tracking Bug
    - LP: #1498065
  * Revert "[Config]
    MFD_INTEL_LPSS/MFD_INTEL_LPSS_ACPI/MFD_INTEL_LPSS_PCI=m"
    - LP: #1498137
  * [Config] Disable the MFD_INTEL_LPSS* driver

linux (3.19.0-30.32) vivid; urgency=low

  [ Brad Figg ]

  * Release Tracking Bug
    - LP: #1498065

  [ Upstream Kernel Changes ]

  * net: Fix skb_set_peeked use-after-free bug
    - LP: #1497184

linux (3.19.0-29.31) vivid; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1493902

  [ Ander Conselvan de Oliveira ]

  * SAUCE: i915_bpo: Set ddi_pll_sel in DP MST path
    - LP: #1483320

  [ Chris J Arges ]

  * [Config] DEFAULT_IOSCHED="deadline" for ppc64el
    - LP: #1469829

  [ Chris Wilson ]

  * SAUCE: i915_bpo: drm/i915: Flag the execlists context object as dirty
    after every use
    - LP: #1489501

  [ Daniel Vetter ]

  * SAUCE: i915_bpo: drm/i915: Only dither on 6bpc panels
    - LP: #1489501

  [ David Henningsson ]

  * SAUCE: drm/i915: Add audio pin sense / ELD callback
    - LP: #1490895
  * SAUCE: drm/i915: Call audio pin/ELD notify function
    - LP: #1490895
  * SAUCE: ubuntu/i915: Call audio pin/ELD notify function
    - LP: #1490895
  * SAUCE: ALSA: hda - Add "hdac_acomp" global variable
    - LP: #1490895
  * SAUCE: ALSA: hda - allow codecs to access the i915 pin/ELD callback
    - LP: #1490895
  * SAUCE: ALSA: hda - Wake the codec up on pin/ELD notify events
    - LP: #1490895

  [ Jani Nikula ]

  * SAUCE: i915_bpo: Revert "drm/i915: Allow parsing of variable size child
    device entries from VBT"
    - LP: #1489501

  [ Maarten Lankhorst ]

  * SAUCE: i915_bpo: drm/i915: calculate primary visibility changes instead
    of calling from set_config
    - LP: #1489501
  * SAUCE: i915_bpo: drm/i915: Commit planes on each crtc separately.
    - LP: #1489501

  [ Thulasimani,Sivakumar ]

  * SAUCE: i915_bpo: Revert "drm/i915: Add eDP intermediate frequencies for
    CHV"
    - LP: #1489501
  * SAUCE: i915_bpo: drm/i915: remove HBR2 from chv supported list
    - LP: #1489501
  * SAUCE: i915_bpo: drm/i915: Avoid TP3 on CHV
    - LP: #1489501

  [ Timo Aaltonen ]

  * Revert "SAUCE: i915_bpo: drm/i915: Allow parsing of variable size child
    device entries from VBT, addendum v2"
    - LP: #1489501
  * SAUCE: Migrate Broadwell to i915_bpo.
    - LP: #1483320

  [ Upstream Kernel Changes ]

  * tcp: fix recv with flags MSG_WAITALL | MSG_PEEK
    - LP: #1486146
  * powerpc/powernv: Fix the overflow of OPAL message notifiers head array
    - LP: #1487085
  * xhci: call BIOS workaround to enable runtime suspend on Intel Braswell
    - LP: #1489292
  * PM / QoS: Make it possible to expose device latency tolerance to
    userspace
    - LP: #1488395
  * ACPI / PM: Attach ACPI power domain only once
    - LP: #1488395
  * Driver core: wakeup the parent device before trying probe
    - LP: #1488395
  * klist: implement klist_prev()
    - LP: #1488395
  * driver core: implement device_for_each_child_reverse()
    - LP: #1488395
  * mfd: make mfd_remove_devices() iterate in reverse order
    ...

Changed in linux (Ubuntu Vivid):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.