Trigger a checkstop on unrecoverable MCE/HMI errors to inform BMC/OCC about the error.

Bug #1482343 reported by bugproxy on 2015-08-06
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Tim Gardner
Vivid
Undecided
Tim Gardner
Wily
Undecided
Tim Gardner

Bug Description

The current implementation of Machine Check handler and HMI handler in Linux, goes down kernel panic path for unrecoverable errors. On FSP based system FSP also gets notified about these errors which then forwards it to PRD (that runs on FSP) for error analysis and gard record creation.

On OpenPower (BMC based system e.g. Habanero from TYAN) where PRD runs in Linux host, it never gets a chance to do error analysis at the time of Linux crash and no gard record is created for such errors. Since the faulty component never gets de-configured, the system is vulnerable to get hit by same HW error again.

To fix this issue, a new OPAL call 'opal_cec_reboot2()' has been introduced to trigger a checkstop on BMC based system to inform BMC/OCC about this error, so that BMC can collect relevant data for error analysis and decide what component to de-configure before rebooting. Linux kernel should invoke this opal call for unrecoverable MCE and HMI instead before calling kernel panic so that OCC is informed about the error.

The kernel changes has already been posted to upstream and are listed below:

https://lists.ozlabs.org/pipermail/linuxppc-dev/2015-May/128341.html
https://lists.ozlabs.org/pipermail/linuxppc-dev/2015-May/128342.html
https://lists.ozlabs.org/pipermail/linuxppc-dev/2015-August/132045.html
https://lists.ozlabs.org/pipermail/linuxppc-dev/2015-August/132114.html

Above patches needs to be included in ubuntu 14.04.3+

We will update this bug with commit ids, once the above patches are accepted upstream.

Contact Information = <email address hidden>

---uname output---
Linux rcx2d403 3.19.0-26-generic #27 SMP Tue Aug 4 01:38:15 CDT 2015 ppc64le ppc64le ppc64le GNU/Linux

---Additional Hardware Info---
Habanero pass2 system

Machine Type = OpenPower, Habanero

---System Hang---
 If system is hung, it can be recovered by sending ipmi power off/on command.
$ ipmitool -H <BMC> -I lanplus -U <user> -P <passwd> power off
$ ipmitool -H <BMC> -I lanplus -U <user> -P <passwd> power on

bugproxy (bugproxy) on 2015-08-06
tags: added: architecture-ppc64le bugnameltc-128601 severity-high targetmilestone-inin---

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1482343/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
Jeremy Kerr (jk-ozlabs) on 2015-08-10
affects: ubuntu → linux (Ubuntu)
Tim Gardner (timg-tpi) on 2015-08-11
Changed in linux (Ubuntu Wily):
assignee: nobody → Tim Gardner (timg-tpi)
status: New → In Progress
Tim Gardner (timg-tpi) wrote :

Patches applied for Wily. Lets wait on an SRU for Vivid/Trusty until they've been merged in 4.3

Changed in linux (Ubuntu Wily):
status: In Progress → Fix Committed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 4.2.0-7.7

---------------
linux (4.2.0-7.7) wily; urgency=low

  [ Tim Gardner ]

  * Release Tracking Bug
    - LP: #1490564
  * rebase to v4.2

  [ Wen Xiong ]

  * SAUCE: ipr: Byte swapping for device_id attribute in sysfs
    - LP: #1453892

  [ Upstream Kernel Changes ]

  * rebase to v4.2
    - LP: #1487345

 -- Tim Gardner <email address hidden> Wed, 26 Aug 2015 07:06:10 -0600

Changed in linux (Ubuntu Wily):
status: Fix Committed → Fix Released
Breno Leitão (breno-leitao) wrote :

Hi Tim,

We would like to have this targeted for 14.04 SRu also. Is it possible?

Tim Gardner (timg-tpi) on 2015-10-06
Changed in linux (Ubuntu Vivid):
assignee: nobody → Tim Gardner (timg-tpi)
status: New → In Progress
Breno Leitão (breno-leitao) wrote :

Tim,

I saw that the patches were acked by "Seth Forshee". Were they commited? Any expected version that they will be released?

Tim Gardner (timg-tpi) wrote :

Applied and in the pipeline for UBUNTU: Ubuntu-3.19.0-32.37

Changed in linux (Ubuntu Vivid):
status: In Progress → Fix Committed
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-vivid' to 'verification-done-vivid'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-vivid
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2015-10-27 15:41 EDT-------
I just verified that issue is fixed in Ubuntu-3.19.0-32.37 kernel version

------------------------------------------------------------------------------------
Ubuntu 14.04.3 LTS ltc-fire14 hvc0

ltc-fire14 login: root
Password:
Last login: Tue Oct 27 10:11:22 CDT 2015 on hvc0
Welcome to Ubuntu 14.04.3 LTS (GNU/Linux 3.19.0-32-generic ppc64le)

* Documentation: https://help.ubuntu.com/
root@ltc-fire14:~# uname -a
Linux ltc-fire14 3.19.0-32-generic #37-Ubuntu SMP Wed Oct 21 10:22:35 UTC 2015 ppc64le ppc64le ppc64le GNU/Linux
root@ltc-fire14:~# cd /home/workload_scripts/
root@ltc-fire14:/home/workload_scripts# ls
find_work.sh run_workload.sh
root@ltc-fire14:/home/workload_scripts# ./run_workload.sh
root@ltc-fire14:/home/workload_scripts# getscom -l
Chip ID | Rev | Chip type
---------|-------|--------
80000085 | DD2.0 | Centaur memory buffer
80000084 | DD2.0 | Centaur memory buffer
80000005 | DD2.0 | Centaur memory buffer
80000004 | DD2.0 | Centaur memory buffer
00000008 | DD2.0 | P8 (Venice) processor
00000000 | DD2.0 | P8 (Venice) processor
root@ltc-fire14:/home/workload_scripts# getscom -c 0x0 11013100
0
root@ltc-fire14:/home/workload_scripts# getscom -c 0x0 11013106
15a20c688a448b01
root@ltc-fire14:/home/workload_scripts# getscom -c 0x0 11013107
ea5c139705980000
root@ltc-fire14:/home/workload_scripts# putscom -c 0x0 11013107 fa5c139705980000
fa5c139705980000
root@ltc-fire14:/home/workload_scripts# getscom -c 0x0 11013107
fa5c139705980000
root@ltc-fire14:/home/workload_scripts# putscom -c 0x0 11013100 1000000000000000
[ 333.045651] Fatal Hypervisor Maintenance interrupt [Not recovered]
[ 333.045916] Error detail: Malfunction Alert
[ 333.046288] HMER: 8040000000000000
[ 333.046543] CPU PIR: 00000000
[ 333.046601] [Unit: IFU] RegFile core check stop
[ 333.046778] [Unit: PC ] Debug Trigger Error inject
1000000000000008[ 333.046883] F
[194049345926,0] OPAL: Reboot requested due to Platform error.at[194049767279,3] OPAL: Reboot requested due to Platform error.al 1.69405|ERRL|Dumping errors reported prior to registration
3.46924|Ignoring boot flags, incorrect version 0x0
3.70396|ISTEP 6. 3
4.14478|ISTEP 6. 4
4.14531|ISTEP 6. 5
10.54385|HWAS|PRESENT> DIMM[03]=00000000AAAAAAAA
10.54386|HWAS|PRESENT> Membuf[04]=0C0C000000000000
10.54387|HWAS|PRESENT> Proc[05]=C000000000000000
23.49515|ISTEP 6. 6
[...]
------------------------------------------------------------------------------------

tags: added: verification-done-vivid
removed: verification-needed-vivid
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 3.19.0-32.37

---------------
linux (3.19.0-32.37) vivid; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1508381

  [ Joseph Salisbury ]

  * SAUCE: storvsc: use small sg_tablesize on x86
    - LP: #1495983

  [ Phidias Chiang ]

  * SAUCE: dma: dw_dmac: Workaround for stop probing on HP X360 laptop v2
    - LP: #1501580

  [ Tim Gardner ]

  * [Config] Add MMC modules sufficient for net booting
    - LP: #1502772

  [ Upstream Kernel Changes ]

  * USB: whiteheat: fix potential null-deref at probe
    - LP: #1478826
    - CVE-2015-5257
  * dcache: Handle escaped paths in prepend_path
    - LP: #1441108
    - CVE-2015-2925
  * vfs: Test for and handle paths that are unreachable from their mnt_root
    - LP: #1441108
    - CVE-2015-2925
  * hv_netvsc: Add support to set MTU reservation from guest side
    - LP: #1494431
  * hv_netvsc: Add close of RNDIS filter into change mtu call
    - LP: #1494431
  * powerpc/eeh: Fix missed PE#0 on P7IOC
    - LP: #1502982
  * powerpc/powernv: display reason for Malfunction Alert HMI.
    - LP: #1482343
  * powerpc/powernv: Pull all HMI events before panic.
    - LP: #1482343
  * powerpc/powernv: Invoke opal_cec_reboot2() on unrecoverable machine
    check errors.
    - LP: #1482343
  * powerpc/powernv: Invoke opal_cec_reboot2() on unrecoverable HMI.
    - LP: #1482343
  * powerpc/eeh: Fix PE#0 check in eeh_add_to_parent_pe()
    - LP: #1502982
  * HID: i2c-hid: The interrupt should be level sensitive v2
    - LP: #1501187
  * HID: i2c-hid: Add support for ACPI GPIO interrupts v2
    - LP: #1501187

 -- Luis Henriques <email address hidden> Wed, 21 Oct 2015 10:30:13 +0100

Changed in linux (Ubuntu Vivid):
status: Fix Committed → Fix Released
bugproxy (bugproxy) on 2015-11-23
tags: added: targetmilestone-inin14043
removed: targetmilestone-inin---
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers