:Ubuntu NV: Panic timeout=0 means Ubuntu does not reboot and recover from HMI (e.g. Core Unit Checkstop)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Invalid
|
Medium
|
Chris J Arges |
Bug Description
Problem Description
=======
I attempted to inject a Core Unit Checkstop error by flipping Core FIR bit 5 on a K80 (Nvidia) 42L Server ( FSP - gp4fp1.
However on injecting the error , the host ( Ubuntu NV ) crashed, and never recovered. OPAL however seemed to stay up and there were a plethora of mail box errors - B182953C logged by the FSP.
Error Inject
--------------
$ putscom pu.ex 10013100 5 1 1 -ib -p3 -c5
s1.ex k0:n0:s0:p03:c5
ecmd_ppc putscom pu.ex 10013100 5 1 1 -ib -p3 -c5
Console message
--------------
[htx@gp4p01] [1m/sys/
[ 1652.976332] Error detail: Malfunction Alert
[ 1652.976402] HMER: 8040000000000000
[ 1652.976450] Kernel panic - not syncing: Unrecoverable HMI exception
[ 1652.976467] CPU: 24 PID: 1261 Comm: kworker/24:1 Tainted: P OE 3.16.0-37-generic #51~14.04.1-Ubuntu
[ 1652.976530] Workqueue: events hmi_event_handler
[ 1652.976561] Call Trace:
[ 1652.976571] [c0000000189bf9e0] [c000000000017330] show_stack+
[ 1652.976647] [c0000000189bfac0] [c0000000009eb8e4] dump_stack+
[ 1652.976674] [c0000000189bfaf0] [c0000000009e2b5c] panic+0x104/0x2a8
[ 1652.976703] [c0000000189bfb80] [c00000000007306c] hmi_event_
[ 1652.976732] [c0000000189bfc50] [c0000000000d62dc] process_
[ 1652.976772] [c0000000189bfce0] [c0000000000d6b80] worker_
[ 1652.976800] [c0000000189bfd80] [c0000000000e0024] kthread+0x114/0x140
[ 1652.976837] [c0000000189bfe30] [c00000000000a468] ret_from_
[ 1652.977085] ---[ end Kernel panic - not syncing: Unrecoverable HMI exception
.
.
.
.
|------
| 0x50227CFC 06/14/2015 22:27:43 System Hypervisor Firmware mbox |
| 0x50227CFC Processed Predictive Error B182953C | --> Unexpected mail box error , needs investigation
|------
| 0x50227CDD 06/14/2015 22:27:21 System Hypervisor Firmware spif |
| 0x50227CDD Processed Predictive Error B182951C | --> Unexpected mail box error , needs investigation
|------
| 0x50227CC0 06/14/2015 22:26:57 System Hypervisor Firmware mbox |
| 0x50227CC0 Processed Predictive Error B182953C | --> Unexpected mail box error , needs investigation
|------
| 0x50227C9E 06/14/2015 22:26:31 System Hypervisor Firmware spif |
| 0x50227C9E Processed Predictive Error B182951C | --> Unexpected mail box error , needs investigation
|------
| 0x50227C88 06/14/2015 22:26:12 System Hypervisor Firmware mbox |
| 0x50227C88 Processed Predictive Error B182953C | --> Unexpected mail box error , needs investigation
|------
| 0x50227C7E 06/14/2015 22:26:06 System Hypervisor Firmware spif |
| 0x50227C7E Processed Predictive Error B182951C | --> Unexpected mail box error , needs investigation
|------
| 0x50227C4F 06/14/2015 22:25:27 System Hypervisor Firmware mbox |
| 0x50227C4F Processed Predictive Error B182953C | --> Unexpected mail box error , needs investigation
|------
| 0x50227C40 06/14/2015 22:25:16 System Hypervisor Firmware spif |
| 0x50227C40 Processed Predictive Error B182951C | --> Unexpected mail box error , needs investigation
|------
| 0x50227C17 06/14/2015 22:24:42 System Hypervisor Firmware mbox |
| 0x50227C17 Processed Predictive Error B182953C | --> Unexpected mail box error , needs investigation
|------
| 0x50227C01 06/14/2015 22:24:25 System Hypervisor Firmware spif |
| 0x50227C01 Processed Predictive Error B182951C | --> Unexpected mail box error , needs investigation
|------
| 0x501CF14F 06/11/2015 15:32:31 Processor Unit (CPU) prdf |
| 0x501CF14F Processed Predictive Error B113E504 | --> Error log corresponding to injected Core Unit Checkstop error ( Core FIR [5] )
|------
== Comment: #3 - MAHESH J. SALGAONKAR <email address hidden> - 2015-06-16 14:06:56 ==
Ah! I see that panic timeout is set to 0 (zero). That means kernel will wait forever after panic. The behaviour reported in this BUG is as expected when panic timeout is set to 0. Hence it is not a BUG.
===================
root@gp4p01:
0
root@gp4p01:
root@gp4p01:
10
root@gp4p01:
After setting panic_timeout to 10 seconds I see that system rebooted on unrecoverable HMI:
[salgaonkarm@mars linux-2.6]$ fsp_cmd -i gp4fp1.
Checking if system 'gp4fp1.
spawn telnet gp4fp1.
Trying 9.3.136.91...
Connected to gp4fp1.
Escape character is '^]'.
Linux 2.6.32-
gp4fp1 login: dev
Password:
$ smgr mfgState
runtime
$
$ putscom pu.ex 10013100 5 1 1 -ib -p3 -c5
s1.ex k0:n0:s0:p03:c5
ecmd_ppc putscom pu.ex 10013100 5 1 1 -ib -p3 -c5
$ smgr mfgState
runtime
$ smgr mfgState
ipling <= System rebooting..
===================
For system to reboot after panic, please set panic timeout to non-zero value and try injecting core checkstop again. You can do that in two ways:
1. Boot kernel with "panic=<secs>" kernel option (See below for valid values)
OR
2. Once OS is booted, echo non-zero value to /proc/sys/
$ echo 10 > /proc/sys/
Please refer to Documentation/
-------
panic= [KNL] Kernel behaviour on panic: delay <timeout>
-------
== Comment: #8 - Stewart Smith <email address hidden> - 2015-09-17 20:03:51 ==
You should be able to provide panic_timeout=180 as a kernel argument as a workaround.
However, this is *completely* an ubuntu bug. Perhaps we want to modify the panic() handler in linux though.
== Comment: #9 - Luciano Chavez <email address hidden> - 2015-09-18 10:47:31 ==
I was speaking to Feroz and Donna this morning (and sent a note about the same thing last night) and to them the issue is not the panic but that it does not reboot after it hits it. As explained there are three methods to override the panic timeout (add kernel.panic= to sysctl,conf, echo value to /proc/sys/
So, I hope Feroz can chime in but they want the hard coded CONFIG_
So, if that is what they want, we will have to send this to Canonical for their take on this.
Hi Canconical,
Can you please comment your take on this issue ?
Thank you.
tags: | added: architecture-ppc64le bugnameltc-126343 severity-high targetmilestone-inin14043 |
Changed in ubuntu: | |
assignee: | nobody → Taco Screen team (taco-screen-team) |
affects: | ubuntu → linux (Ubuntu) |
Changed in linux (Ubuntu): | |
assignee: | Taco Screen team (taco-screen-team) → Canonical Kernel Team (canonical-kernel-team) |
status: | New → Triaged |
Changed in linux (Ubuntu): | |
assignee: | Canonical Kernel Team (canonical-kernel-team) → Chris J Arges (arges) |
importance: | Undecided → Medium |
status: | Triaged → In Progress |
Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https:/ /wiki.ubuntu. com/Bugs/ FindRightPackag e. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.
To change the source package that this bug is filed about visit https:/ /bugs.launchpad .net/ubuntu/ +bug/1512593/ +editstatus and add the package name in the text box next to the word Package.
[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]