2019-10-16 19:43:39 |
Manoj Iyer |
description |
== Comment: #0 - PAVAMAN SUBRAMANIYAM <pavsubra@in.ibm.com> - 2019-05-07 23:31:20 ==
Install a P9 Open Power Hardware with the latest OP930 Firmware images built from the upstream op-build git tree.
root@witherspoon:~# cat /etc/os-release
ID="openbmc-phosphor"
NAME="Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro)"
VERSION="ibm-v2.3"
VERSION_ID="ibm-v2.3-476-g2d622cb-r32-0-g9973ab0"
PRETTY_NAME="Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro) ibm-v2.3"
BUILD_ID="ibm-v2.3-476-g2d622cb-r32"
root@witherspoon:~# cat /var/lib/phosphor-software-manager/pnor/ro/VERSION
open-power-witherspoon-v2.3-rc2-58-g59fd0743
buildroot-2019.02.2-17-g93b841d204
skiboot-v6.3-rc2
hostboot-19a436e
occ-58e422d
linux-5.0.9-openpower1-p3a4d5a4
petitboot-v1.10.3
machine-xml-a6f4df3
hostboot-binaries-hw043019a.940
capp-ucode-p9-dd2-v4
sbe-249671d
hcode-hw040319a.940
Then enable sw xstop manually by using below command:
root@ltc-wspoon11:~# nvram -p ibm,skiboot --update-config opal-sw-xstop=enable
root@ltc-wspoon11:~# nvram -p ibm,skiboot --print-config
"ibm,skiboot" Partition
--------------------------
experimental-fast-reset=1
snarf-mode=noooooo
opal-sw-xstop=enable
Then from the Linux HOST injected the MCE UE Error on the machine as follows:
root@ltc-wspoon11:~# ./probe_cpus.sh -L
CHIP ID: 0 CORE ID: 0 THREADS: 4 CPUs: 0 1 2 3
CHIP ID: 0 CORE ID: 1 THREADS: 4 CPUs: 4 5 6 7
CHIP ID: 0 CORE ID: 2 THREADS: 4 CPUs: 8 9 10 11
CHIP ID: 0 CORE ID: 3 THREADS: 4 CPUs: 12 13 14 15
CHIP ID: 0 CORE ID: 6 THREADS: 4 CPUs: 16 17 18 19
CHIP ID: 0 CORE ID: 7 THREADS: 4 CPUs: 20 21 22 23
CHIP ID: 0 CORE ID: 8 THREADS: 4 CPUs: 24 25 26 27
CHIP ID: 0 CORE ID: 9 THREADS: 4 CPUs: 28 29 30 31
CHIP ID: 0 CORE ID: 10 THREADS: 4 CPUs: 32 33 34 35
CHIP ID: 0 CORE ID: 11 THREADS: 4 CPUs: 36 37 38 39
CHIP ID: 0 CORE ID: 12 THREADS: 4 CPUs: 40 41 42 43
CHIP ID: 0 CORE ID: 13 THREADS: 4 CPUs: 44 45 46 47
CHIP ID: 0 CORE ID: 16 THREADS: 4 CPUs: 48 49 50 51
CHIP ID: 0 CORE ID: 17 THREADS: 4 CPUs: 52 53 54 55
CHIP ID: 0 CORE ID: 18 THREADS: 4 CPUs: 56 57 58 59
CHIP ID: 0 CORE ID: 19 THREADS: 4 CPUs: 60 61 62 63
CHIP ID: 0 CORE ID: 20 THREADS: 4 CPUs: 64 65 66 67
CHIP ID: 0 CORE ID: 21 THREADS: 4 CPUs: 68 69 70 71
CHIP ID: 8 CORE ID: 6 THREADS: 4 CPUs: 72 73 74 75
CHIP ID: 8 CORE ID: 7 THREADS: 4 CPUs: 76 77 78 79
CHIP ID: 8 CORE ID: 8 THREADS: 4 CPUs: 80 81 82 83
CHIP ID: 8 CORE ID: 9 THREADS: 4 CPUs: 84 85 86 87
CHIP ID: 8 CORE ID: 10 THREADS: 4 CPUs: 88 89 90 91
CHIP ID: 8 CORE ID: 11 THREADS: 4 CPUs: 92 93 94 95
CHIP ID: 8 CORE ID: 12 THREADS: 4 CPUs: 96 97 98 99
CHIP ID: 8 CORE ID: 13 THREADS: 4 CPUs: 100 101 102 103
CHIP ID: 8 CORE ID: 14 THREADS: 4 CPUs: 104 105 106 107
CHIP ID: 8 CORE ID: 15 THREADS: 4 CPUs: 108 109 110 111
CHIP ID: 8 CORE ID: 16 THREADS: 4 CPUs: 112 113 114 115
CHIP ID: 8 CORE ID: 17 THREADS: 4 CPUs: 116 117 118 119
CHIP ID: 8 CORE ID: 18 THREADS: 4 CPUs: 120 121 122 123
CHIP ID: 8 CORE ID: 19 THREADS: 4 CPUs: 124 125 126 127
CHIP ID: 8 CORE ID: 20 THREADS: 4 CPUs: 128 129 130 131
CHIP ID: 8 CORE ID: 21 THREADS: 4 CPUs: 132 133 134 135
CHIP ID: 8 CORE ID: 22 THREADS: 4 CPUs: 136 137 138 139
CHIP ID: 8 CORE ID: 23 THREADS: 4 CPUs: 140 141 142 143
-----------------------------
p[0]
eq[0,1,2,3,4,5]
ex[0,1,3,4,5,6,8,9,10]
c[0,1,2,3,6,7,8,9,10,11,12,13,16,17,18,19,20,21]
p[8]
eq[1,2,3,4,5]
ex[3,4,5,6,7,8,9,10,11]
c[6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]
-----------------------------
----------Processor Layout-------------------
p[0]
+---EQ00----+ +---EQ02----+ +---EQ04----+
|EX-0 C0 | |EX-4 C8 | |EX-8 C16|
+ - - - - - + + - - - - - + + - - - - - +
|EX-0 C1 | |EX-4 C9 | |EX-8 C17|
+ - - - - - + + - - - - - + + - - - - - +
|EX-1 C2 | |EX-5 C10| |EX-9 C18|
+ - - - - - + + - - - - - + + - - - - - +
|EX-1 C3 | |EX-5 C11| |EX-9 C19|
+-----------+ +-----------+ +-----------+
+---EQ01----+ +---EQ03----+ +---EQ05----+
| | |EX-6 C12| |EX-10 C20|
+ - - - - - + + - - - - - + + - - - - - +
| | |EX-6 C13| |EX-10 C21|
+ - - - - - + + - - - - - + + - - - - - +
|EX-3 C6 | | | | |
+ - - - - - + + - - - - - + + - - - - - +
|EX-3 C7 | | | | |
+-----------+ +-----------+ +-----------+
p[8]
+---EQ00----+ +---EQ02----+ +---EQ04----+
| | |EX-4 C8 | |EX-8 C16|
+ - - - - - + + - - - - - + + - - - - - +
| | |EX-4 C9 | |EX-8 C17|
+ - - - - - + + - - - - - + + - - - - - +
| | |EX-5 C10| |EX-9 C18|
+ - - - - - + + - - - - - + + - - - - - +
| | |EX-5 C11| |EX-9 C19|
+-----------+ +-----------+ +-----------+
+---EQ01----+ +---EQ03----+ +---EQ05----+
| | |EX-6 C12| |EX-10 C20|
+ - - - - - + + - - - - - + + - - - - - +
| | |EX-6 C13| |EX-10 C21|
+ - - - - - + + - - - - - + + - - - - - +
|EX-3 C6 | |EX-7 C14| |EX-11 C22|
+ - - - - - + + - - - - - + + - - - - - +
|EX-3 C7 | |EX-7 C15| |EX-11 C23|
+-----------+ +-----------+ +-----------+
root@ltc-wspoon11:~# ./statedisable.sh
./statedisable.sh: line 10: /sys/devices/system/cpu/cpu*/cpuidle/state7/disable: No such file or directory
./statedisable.sh: line 11: /sys/devices/system/cpu/cpu*/cpuidle/state8/disable: No such file or directory
root@ltc-wspoon11:~# cpupower idle-info
CPUidle driver: powernv_idle
CPUidle governor: menu
analyzing CPU 0:
Number of idle states: 7
Available idle states: snooze stop0_lite stop0 stop1 stop2 stop4 stop5
snooze (DISABLED) :
Flags/Description: snooze
Latency: 0
Usage: 81861
Duration: 29748269
stop0_lite (DISABLED) :
Flags/Description: stop0_lite
Latency: 1
Usage: 70
Duration: 1982345
stop0 (DISABLED) :
Flags/Description: stop0
Latency: 2
Usage: 274
Duration: 125896
stop1 (DISABLED) :
Flags/Description: stop1
Latency: 5
Usage: 36
Duration: 4922
stop2 (DISABLED) :
Flags/Description: stop2
Latency: 10
Usage: 3745
Duration: 88300041
stop4 (DISABLED) :
Flags/Description: stop4
Latency: 100
Usage: 65
Duration: 1048951
stop5 (DISABLED) :
Flags/Description: stop5
Latency: 200
Usage: 30377
Duration: 61977191643
root@ltc-wspoon11:~#./run_workload.sh
root@ltc-wspoon11:~# ./scom_addr_p9.sh 0x1001080c 15
EQ[ 3]: 0x1301080c
EX[ 7]: 0x13010c0c
C[15]: 0x3f01080c
root@ltc-wspoon11:~# ./skiboot/external/xscom-utils/getscom -c 0x8 0x13010c0c
0000000000000000
root@ltc-wspoon11:~# ./skiboot/external/xscom-utils/putscom -c 0x8 0x13010c0c 0c00000000000000
0c00000000000000
root@ltc-wspoon11:~# ./skiboot/external/xscom-utils/putscom -c 0x8 0x13010c0c 0c00000000000000
0c00000000000000
After injecting the Machine check error, the HOST Linux stops pinging and the console access to the machine also gets lost.
But still the Open BMC shell and GUI still shows that the HOST is in Running state.
== Comment: #1 - PAVAMAN SUBRAMANIYAM <pavsubra@in.ibm.com> - 2019-05-07 23:33:31 ==
The machine is installed with the Ubuntu 18.04 Linux OS.
root@ltc-wspoon11:~# uname -a
Linux ltc-wspoon11 4.15.0-48-generic #51-Ubuntu SMP Wed Apr 3 08:26:19 UTC 2019 ppc64le ppc64le ppc64le GNU/Linux
root@ltc-wspoon11:~# cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.2 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.2 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
root@ltc-wspoon11:~# cat /proc/cpuinfo | tail
cpu : POWER9, altivec supported
clock : 2300.000000MHz
revision : 2.3 (pvr 004e 1203)
timebase : 512000000
platform : PowerNV
model : 8335-GTH
machine : PowerNV 8335-GTH
firmware : OPAL
MMU : Radix
root@ltc-wspoon11:~# lsmcode
Version of System Firmware :
Product Name : OpenPOWER Firmware
Product Version : witherspoon-v2.3-rc2-58-g59fd0743
Product Extra : skiboot-v6.3-rc2
Product Extra : bmc-firmware-version-2.03
Product Extra : occ-58e422d
Product Extra : hostboot-19a436e
Product Extra : buildroot-2019.02.2-17-g93b841d204
Product Extra : capp-ucode-p9-dd2-v4
Product Extra : machine-xml-a6f4df3
Product Extra : hostboot-binaries-hw043019a.940
Product Extra : sbe-249671d
Product Extra : hcode-hw040319a.940
Product Extra : petitboot-v1.10.3
Product Extra : linux-5.0.9-openpower1-p3a4d5a4
== Comment: #3 - PAVAMAN SUBRAMANIYAM <pavsubra@in.ibm.com> - 2019-05-07 23:42:35 ==
I quickly tested MCE on op930 build ( IBM-witherspoon-ibm-OP9-v2.2-3.5) with 4.15.0-47-generic and found no hang. But on further investigation I see that the hang issue is seen from kernel version 4.15.0-48-generic and above. Looks like changes that gone in 4.15.0-48-generic version causing the hang issue. Still investigating....
== Comment: #9 - Application Cdeadmin <cdeadmin@us.ibm.com> - 2019-05-22 06:45:07 ==
==== State: Working by: jayeshp on 22 May 2019 06:37:27 ====
Any update?
== Comment: #11 - MAHESH J. SALGAONKAR <mahesh.salgaonkar@in.ibm.com> - 2019-09-19 04:44:01 ==
The hang issues should go away with below patch.
commit 99ead78afd1128bfcebe7f88f3b102fb2da09aee
Author: Balbir Singh <bsingharora@gmail.com>
Date: Tue Aug 20 13:43:47 2019 +0530
powerpc/mce: Fix MCE handling for huge pages
The current code would fail on huge pages addresses, since the shift would
be incorrect. Use the correct page shift value returned by
__find_linux_pte() to get the correct physical address. The code is more
generic and can handle both regular and compound pages.
Fixes: ba41e1e1ccb9 ("powerpc/mce: Hookup derror (load/store) UE errors")
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
[arbab@linux.ibm.com: Fixup pseries_do_memory_failure()]
Signed-off-by: Reza Arbab <arbab@linux.ibm.com>
Tested-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Santosh Sivaraj <santosh@fossix.org>
Cc: stable@vger.kernel.org # v4.15+
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190820081352.8641-3-santosh@fossix.org |
[IMPACT]
MCE test renders the system unresponsive on P9 open power hardware (Withersoon)
[TEST]
A test kernel is available in ppa:ubuntu-power-triage/lp1848127. Please see the [OTHER] section for test details and comment #7 for results with the PPA kernel.
[FIX]
IBM has identified the following patch that fixes this issue:
commit 99ead78afd1128bfcebe7f88f3b102fb2da09aee
Author: Balbir Singh <bsingharora@gmail.com>
Date: Tue Aug 20 13:43:47 2019 +0530
powerpc/mce: Fix MCE handling for huge pages
[REGRESSION POTENTIAL]
The patch is applicable the powerpc architecture and limited in scope to MCE handling for huge pages. Patch does not touch any generic code. Regression if any is limited to powerpc MCE handling.
[OTHER]
== Comment: #0 - PAVAMAN SUBRAMANIYAM <pavsubra@in.ibm.com> - 2019-05-07 23:31:20 ==
Install a P9 Open Power Hardware with the latest OP930 Firmware images built from the upstream op-build git tree.
root@witherspoon:~# cat /etc/os-release
ID="openbmc-phosphor"
NAME="Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro)"
VERSION="ibm-v2.3"
VERSION_ID="ibm-v2.3-476-g2d622cb-r32-0-g9973ab0"
PRETTY_NAME="Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro) ibm-v2.3"
BUILD_ID="ibm-v2.3-476-g2d622cb-r32"
root@witherspoon:~# cat /var/lib/phosphor-software-manager/pnor/ro/VERSION
open-power-witherspoon-v2.3-rc2-58-g59fd0743
buildroot-2019.02.2-17-g93b841d204
skiboot-v6.3-rc2
hostboot-19a436e
occ-58e422d
linux-5.0.9-openpower1-p3a4d5a4
petitboot-v1.10.3
machine-xml-a6f4df3
hostboot-binaries-hw043019a.940
capp-ucode-p9-dd2-v4
sbe-249671d
hcode-hw040319a.940
Then enable sw xstop manually by using below command:
root@ltc-wspoon11:~# nvram -p ibm,skiboot --update-config opal-sw-xstop=enable
root@ltc-wspoon11:~# nvram -p ibm,skiboot --print-config
"ibm,skiboot" Partition
--------------------------
experimental-fast-reset=1
snarf-mode=noooooo
opal-sw-xstop=enable
Then from the Linux HOST injected the MCE UE Error on the machine as follows:
root@ltc-wspoon11:~# ./probe_cpus.sh -L
CHIP ID: 0 CORE ID: 0 THREADS: 4 CPUs: 0 1 2 3
CHIP ID: 0 CORE ID: 1 THREADS: 4 CPUs: 4 5 6 7
CHIP ID: 0 CORE ID: 2 THREADS: 4 CPUs: 8 9 10 11
CHIP ID: 0 CORE ID: 3 THREADS: 4 CPUs: 12 13 14 15
CHIP ID: 0 CORE ID: 6 THREADS: 4 CPUs: 16 17 18 19
CHIP ID: 0 CORE ID: 7 THREADS: 4 CPUs: 20 21 22 23
CHIP ID: 0 CORE ID: 8 THREADS: 4 CPUs: 24 25 26 27
CHIP ID: 0 CORE ID: 9 THREADS: 4 CPUs: 28 29 30 31
CHIP ID: 0 CORE ID: 10 THREADS: 4 CPUs: 32 33 34 35
CHIP ID: 0 CORE ID: 11 THREADS: 4 CPUs: 36 37 38 39
CHIP ID: 0 CORE ID: 12 THREADS: 4 CPUs: 40 41 42 43
CHIP ID: 0 CORE ID: 13 THREADS: 4 CPUs: 44 45 46 47
CHIP ID: 0 CORE ID: 16 THREADS: 4 CPUs: 48 49 50 51
CHIP ID: 0 CORE ID: 17 THREADS: 4 CPUs: 52 53 54 55
CHIP ID: 0 CORE ID: 18 THREADS: 4 CPUs: 56 57 58 59
CHIP ID: 0 CORE ID: 19 THREADS: 4 CPUs: 60 61 62 63
CHIP ID: 0 CORE ID: 20 THREADS: 4 CPUs: 64 65 66 67
CHIP ID: 0 CORE ID: 21 THREADS: 4 CPUs: 68 69 70 71
CHIP ID: 8 CORE ID: 6 THREADS: 4 CPUs: 72 73 74 75
CHIP ID: 8 CORE ID: 7 THREADS: 4 CPUs: 76 77 78 79
CHIP ID: 8 CORE ID: 8 THREADS: 4 CPUs: 80 81 82 83
CHIP ID: 8 CORE ID: 9 THREADS: 4 CPUs: 84 85 86 87
CHIP ID: 8 CORE ID: 10 THREADS: 4 CPUs: 88 89 90 91
CHIP ID: 8 CORE ID: 11 THREADS: 4 CPUs: 92 93 94 95
CHIP ID: 8 CORE ID: 12 THREADS: 4 CPUs: 96 97 98 99
CHIP ID: 8 CORE ID: 13 THREADS: 4 CPUs: 100 101 102 103
CHIP ID: 8 CORE ID: 14 THREADS: 4 CPUs: 104 105 106 107
CHIP ID: 8 CORE ID: 15 THREADS: 4 CPUs: 108 109 110 111
CHIP ID: 8 CORE ID: 16 THREADS: 4 CPUs: 112 113 114 115
CHIP ID: 8 CORE ID: 17 THREADS: 4 CPUs: 116 117 118 119
CHIP ID: 8 CORE ID: 18 THREADS: 4 CPUs: 120 121 122 123
CHIP ID: 8 CORE ID: 19 THREADS: 4 CPUs: 124 125 126 127
CHIP ID: 8 CORE ID: 20 THREADS: 4 CPUs: 128 129 130 131
CHIP ID: 8 CORE ID: 21 THREADS: 4 CPUs: 132 133 134 135
CHIP ID: 8 CORE ID: 22 THREADS: 4 CPUs: 136 137 138 139
CHIP ID: 8 CORE ID: 23 THREADS: 4 CPUs: 140 141 142 143
-----------------------------
p[0]
eq[0,1,2,3,4,5]
ex[0,1,3,4,5,6,8,9,10]
c[0,1,2,3,6,7,8,9,10,11,12,13,16,17,18,19,20,21]
p[8]
eq[1,2,3,4,5]
ex[3,4,5,6,7,8,9,10,11]
c[6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]
-----------------------------
----------Processor Layout-------------------
p[0]
+---EQ00----+ +---EQ02----+ +---EQ04----+
|EX-0 C0 | |EX-4 C8 | |EX-8 C16|
+ - - - - - + + - - - - - + + - - - - - +
|EX-0 C1 | |EX-4 C9 | |EX-8 C17|
+ - - - - - + + - - - - - + + - - - - - +
|EX-1 C2 | |EX-5 C10| |EX-9 C18|
+ - - - - - + + - - - - - + + - - - - - +
|EX-1 C3 | |EX-5 C11| |EX-9 C19|
+-----------+ +-----------+ +-----------+
+---EQ01----+ +---EQ03----+ +---EQ05----+
| | |EX-6 C12| |EX-10 C20|
+ - - - - - + + - - - - - + + - - - - - +
| | |EX-6 C13| |EX-10 C21|
+ - - - - - + + - - - - - + + - - - - - +
|EX-3 C6 | | | | |
+ - - - - - + + - - - - - + + - - - - - +
|EX-3 C7 | | | | |
+-----------+ +-----------+ +-----------+
p[8]
+---EQ00----+ +---EQ02----+ +---EQ04----+
| | |EX-4 C8 | |EX-8 C16|
+ - - - - - + + - - - - - + + - - - - - +
| | |EX-4 C9 | |EX-8 C17|
+ - - - - - + + - - - - - + + - - - - - +
| | |EX-5 C10| |EX-9 C18|
+ - - - - - + + - - - - - + + - - - - - +
| | |EX-5 C11| |EX-9 C19|
+-----------+ +-----------+ +-----------+
+---EQ01----+ +---EQ03----+ +---EQ05----+
| | |EX-6 C12| |EX-10 C20|
+ - - - - - + + - - - - - + + - - - - - +
| | |EX-6 C13| |EX-10 C21|
+ - - - - - + + - - - - - + + - - - - - +
|EX-3 C6 | |EX-7 C14| |EX-11 C22|
+ - - - - - + + - - - - - + + - - - - - +
|EX-3 C7 | |EX-7 C15| |EX-11 C23|
+-----------+ +-----------+ +-----------+
root@ltc-wspoon11:~# ./statedisable.sh
./statedisable.sh: line 10: /sys/devices/system/cpu/cpu*/cpuidle/state7/disable: No such file or directory
./statedisable.sh: line 11: /sys/devices/system/cpu/cpu*/cpuidle/state8/disable: No such file or directory
root@ltc-wspoon11:~# cpupower idle-info
CPUidle driver: powernv_idle
CPUidle governor: menu
analyzing CPU 0:
Number of idle states: 7
Available idle states: snooze stop0_lite stop0 stop1 stop2 stop4 stop5
snooze (DISABLED) :
Flags/Description: snooze
Latency: 0
Usage: 81861
Duration: 29748269
stop0_lite (DISABLED) :
Flags/Description: stop0_lite
Latency: 1
Usage: 70
Duration: 1982345
stop0 (DISABLED) :
Flags/Description: stop0
Latency: 2
Usage: 274
Duration: 125896
stop1 (DISABLED) :
Flags/Description: stop1
Latency: 5
Usage: 36
Duration: 4922
stop2 (DISABLED) :
Flags/Description: stop2
Latency: 10
Usage: 3745
Duration: 88300041
stop4 (DISABLED) :
Flags/Description: stop4
Latency: 100
Usage: 65
Duration: 1048951
stop5 (DISABLED) :
Flags/Description: stop5
Latency: 200
Usage: 30377
Duration: 61977191643
root@ltc-wspoon11:~#./run_workload.sh
root@ltc-wspoon11:~# ./scom_addr_p9.sh 0x1001080c 15
EQ[ 3]: 0x1301080c
EX[ 7]: 0x13010c0c
C[15]: 0x3f01080c
root@ltc-wspoon11:~# ./skiboot/external/xscom-utils/getscom -c 0x8 0x13010c0c
0000000000000000
root@ltc-wspoon11:~# ./skiboot/external/xscom-utils/putscom -c 0x8 0x13010c0c 0c00000000000000
0c00000000000000
root@ltc-wspoon11:~# ./skiboot/external/xscom-utils/putscom -c 0x8 0x13010c0c 0c00000000000000
0c00000000000000
After injecting the Machine check error, the HOST Linux stops pinging and the console access to the machine also gets lost.
But still the Open BMC shell and GUI still shows that the HOST is in Running state.
== Comment: #1 - PAVAMAN SUBRAMANIYAM <pavsubra@in.ibm.com> - 2019-05-07 23:33:31 ==
The machine is installed with the Ubuntu 18.04 Linux OS.
root@ltc-wspoon11:~# uname -a
Linux ltc-wspoon11 4.15.0-48-generic #51-Ubuntu SMP Wed Apr 3 08:26:19 UTC 2019 ppc64le ppc64le ppc64le GNU/Linux
root@ltc-wspoon11:~# cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.2 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.2 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
root@ltc-wspoon11:~# cat /proc/cpuinfo | tail
cpu : POWER9, altivec supported
clock : 2300.000000MHz
revision : 2.3 (pvr 004e 1203)
timebase : 512000000
platform : PowerNV
model : 8335-GTH
machine : PowerNV 8335-GTH
firmware : OPAL
MMU : Radix
root@ltc-wspoon11:~# lsmcode
Version of System Firmware :
Product Name : OpenPOWER Firmware
Product Version : witherspoon-v2.3-rc2-58-g59fd0743
Product Extra : skiboot-v6.3-rc2
Product Extra : bmc-firmware-version-2.03
Product Extra : occ-58e422d
Product Extra : hostboot-19a436e
Product Extra : buildroot-2019.02.2-17-g93b841d204
Product Extra : capp-ucode-p9-dd2-v4
Product Extra : machine-xml-a6f4df3
Product Extra : hostboot-binaries-hw043019a.940
Product Extra : sbe-249671d
Product Extra : hcode-hw040319a.940
Product Extra : petitboot-v1.10.3
Product Extra : linux-5.0.9-openpower1-p3a4d5a4
== Comment: #3 - PAVAMAN SUBRAMANIYAM <pavsubra@in.ibm.com> - 2019-05-07 23:42:35 ==
I quickly tested MCE on op930 build ( IBM-witherspoon-ibm-OP9-v2.2-3.5) with 4.15.0-47-generic and found no hang. But on further investigation I see that the hang issue is seen from kernel version 4.15.0-48-generic and above. Looks like changes that gone in 4.15.0-48-generic version causing the hang issue. Still investigating....
== Comment: #9 - Application Cdeadmin <cdeadmin@us.ibm.com> - 2019-05-22 06:45:07 ==
==== State: Working by: jayeshp on 22 May 2019 06:37:27 ====
Any update?
== Comment: #11 - MAHESH J. SALGAONKAR <mahesh.salgaonkar@in.ibm.com> - 2019-09-19 04:44:01 ==
The hang issues should go away with below patch.
commit 99ead78afd1128bfcebe7f88f3b102fb2da09aee
Author: Balbir Singh <bsingharora@gmail.com>
Date: Tue Aug 20 13:43:47 2019 +0530
powerpc/mce: Fix MCE handling for huge pages
The current code would fail on huge pages addresses, since the shift would
be incorrect. Use the correct page shift value returned by
__find_linux_pte() to get the correct physical address. The code is more
generic and can handle both regular and compound pages.
Fixes: ba41e1e1ccb9 ("powerpc/mce: Hookup derror (load/store) UE errors")
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
[arbab@linux.ibm.com: Fixup pseries_do_memory_failure()]
Signed-off-by: Reza Arbab <arbab@linux.ibm.com>
Tested-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Santosh Sivaraj <santosh@fossix.org>
Cc: stable@vger.kernel.org # v4.15+
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190820081352.8641-3-santosh@fossix.org |
|