[LTCTest][OPAL][OP910.20] WARNING: CPU: 97 PID: 11965 at /build/linux-0zaMZw/linux-4.15.0/kernel/sched/core.c:1189 set_task_cpu+0x240/0x250
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
The Ubuntu-power-systems project |
Fix Released
|
High
|
Canonical Kernel Team | ||
linux (Ubuntu) |
Fix Released
|
High
|
Unassigned | ||
Bionic |
Fix Released
|
High
|
Unassigned |
Bug Description
== SRU Justification ==
IBM reports seeing the following during their testing:
WARNING: CPU: 97 PID: 11965 at /build/
This is a regression and was introduced by the following two commits in
v4.15-rc1:
01eaac2b0591 ("powerpc/mce: Hookup ierror (instruction) UE errors")
ba41e1e1ccb9 ("powerpc/mce: Hookup derror (load/store) UE errors")
This regression is fixed by commit 75ecfb49516c in v4.17-rc3. The
commit was also cc'd to upstream stable, but it is being SRU'd to get
the fix into Ubuntu without waiting for it to come down via stable
updates.
== Fix ==
75ecfb49516c ("powerpc/mce: Fix a bug where mce loops on memory UE.")
== Regression Potential ==
Low. Limited to powerpc. The commit was also cc'd to upstream stable
so it will recieve additional upstream stable review.
== Test Case ==
A test kernel was built with this patch and tested by the original bug reporter.
The bug reporter states the test kernel resolved the bug.
== Original Bug Descriptions ==
== Comment: #0 - PAVAMAN SUBRAMANIYAM <> - 2018-04-25 01:59:10 ==
---Problem Description---
WARNING: CPU: 97 PID: 11965 at /build/
---uname output---
Linux ltc-wspoon8 4.15.0-20-generic #21-Ubuntu SMP Tue Apr 24 06:14:44 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux
Machine Type = P9
---Debugger---
A debugger is not configured
---Steps to Reproduce---
Install a P9 Open Power Hardware with the latest OP910.20 Firmware images.
root@witherspoon:~# cat /etc/os-release
ID="openbmc-
NAME="Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro)"
VERSION="ibm-v2.0"
VERSION_
PRETTY_
BUILD_ID=
root@witherspoon:~# cat /var/lib/
open-power-
occ-8c5b727
sbe-7e02c23
Then we have installed the Ubuntu 18.04 OS on the machine.
root@ltc-wspoon8:~# uname -a
Linux ltc-wspoon8 4.15.0-20-generic #21-Ubuntu SMP Tue Apr 24 06:14:44 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux
root@ltc-wspoon8:~# cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04 LTS"
VERSION_ID="18.04"
HOME_URL="https:/
SUPPORT_URL="https:/
BUG_REPORT_URL="https:/
PRIVACY_
VERSION_
UBUNTU_
root@ltc-wspoon8:~# cat /proc/cpuinfo | tail
cpu : POWER9, altivec supported
clock : 2300.000000MHz
revision : 2.1 (pvr 004e 1201)
timebase : 512000000
platform : PowerNV
model : 8335-GTC........
machine : PowerNV 8335-GTC........
firmware : OPAL
MMU : Radix
root@ltc-wspoon8:~# kdump-config show
DUMP_MODE: kdump
USE_KDUMP: 1
KDUMP_SYSCTL: kernel.
KDUMP_COREDIR: /var/crash
crashkernel addr:
/var/
kdump initrd:
/var/
current state: ready to kdump
kexec command:
/sbin/kexec -p --command-
root@ltc-wspoon8:~# ps -ef | grep opal
root 880 2 0 01:25 ? 00:00:00 [kopald]
root 3604 1 2 01:25 ? 00:00:03 /usr/sbin/opal-prd
root 4858 4278 0 01:28 pts/0 00:00:00 grep --color=auto opal
root@ltc-wspoon8:~# service opal-prd status
? opal-prd.service - OPAL PRD daemon
Loaded: loaded (/lib/systemd/
Active: active (running) since Wed 2018-04-25 01:25:48 CDT; 2min 43s ago
Docs: man:opal-prd(8)
Main PID: 3604 (opal-prd)
Tasks: 1 (limit: 22118)
CGroup: /system.
??3604 /usr/sbin/opal-prd
Apr 25 01:25:52 ltc-wspoon8 opal-prd[3604]: IMAGE: hbrt_init complete, version 0290000000000000
Apr 25 01:25:52 ltc-wspoon8 opal-prd[3604]: HBRT: hservices_init done
Apr 25 01:25:52 ltc-wspoon8 opal-prd[3604]: HBRT: calling enable_attns
Apr 25 01:25:52 ltc-wspoon8 opal-prd[3604]: HBRT: ATTN_SLOW:
Apr 25 01:25:52 ltc-wspoon8 opal-prd[3604]: HBRT: ATTN_SLOW:
Apr 25 01:25:52 ltc-wspoon8 opal-prd[3604]: HBRT: ATTN_SLOW:
Apr 25 01:25:52 ltc-wspoon8 opal-prd[3604]: HBRT: ATTN_SLOW:
Apr 25 01:25:52 ltc-wspoon8 opal-prd[3604]: HBRT: calling get_ipoll_events
Apr 25 01:25:52 ltc-wspoon8 opal-prd[3604]: HBRT: enabling IPOLL events 0x5b90000000000000
Apr 25 01:25:52 ltc-wspoon8 opal-prd[3604]: FW: writing init message
We try to inject the Machine Check Memory UE error using scom utilities.
root@ltc-wspoon8:~# ./probe_cpus.sh -L
CHIP ID: 0 CORE ID: 0 THREADS: 4 CPUs: 0 1 2 3
CHIP ID: 0 CORE ID: 1 THREADS: 4 CPUs: 4 5 6 7
CHIP ID: 0 CORE ID: 2 THREADS: 4 CPUs: 8 9 10 11
CHIP ID: 0 CORE ID: 3 THREADS: 4 CPUs: 12 13 14 15
CHIP ID: 0 CORE ID: 4 THREADS: 4 CPUs: 16 17 18 19
CHIP ID: 0 CORE ID: 5 THREADS: 4 CPUs: 20 21 22 23
CHIP ID: 0 CORE ID: 8 THREADS: 4 CPUs: 24 25 26 27
CHIP ID: 0 CORE ID: 9 THREADS: 4 CPUs: 28 29 30 31
CHIP ID: 0 CORE ID: 10 THREADS: 4 CPUs: 32 33 34 35
CHIP ID: 0 CORE ID: 11 THREADS: 4 CPUs: 36 37 38 39
CHIP ID: 0 CORE ID: 14 THREADS: 4 CPUs: 40 41 42 43
CHIP ID: 0 CORE ID: 15 THREADS: 4 CPUs: 44 45 46 47
CHIP ID: 0 CORE ID: 16 THREADS: 4 CPUs: 48 49 50 51
CHIP ID: 0 CORE ID: 17 THREADS: 4 CPUs: 52 53 54 55
CHIP ID: 0 CORE ID: 18 THREADS: 4 CPUs: 56 57 58 59
CHIP ID: 0 CORE ID: 19 THREADS: 4 CPUs: 60 61 62 63
CHIP ID: 0 CORE ID: 22 THREADS: 4 CPUs: 64 65 66 67
CHIP ID: 0 CORE ID: 23 THREADS: 4 CPUs: 68 69 70 71
CHIP ID: 8 CORE ID: 0 THREADS: 4 CPUs: 72 73 74 75
CHIP ID: 8 CORE ID: 1 THREADS: 4 CPUs: 76 77 78 79
CHIP ID: 8 CORE ID: 2 THREADS: 4 CPUs: 80 81 82 83
CHIP ID: 8 CORE ID: 3 THREADS: 4 CPUs: 84 85 86 87
CHIP ID: 8 CORE ID: 4 THREADS: 4 CPUs: 88 89 90 91
CHIP ID: 8 CORE ID: 5 THREADS: 4 CPUs: 92 93 94 95
CHIP ID: 8 CORE ID: 6 THREADS: 4 CPUs: 96 97 98 99
CHIP ID: 8 CORE ID: 7 THREADS: 4 CPUs: 100 101 102 103
CHIP ID: 8 CORE ID: 10 THREADS: 4 CPUs: 104 105 106 107
CHIP ID: 8 CORE ID: 11 THREADS: 4 CPUs: 108 109 110 111
CHIP ID: 8 CORE ID: 12 THREADS: 4 CPUs: 112 113 114 115
CHIP ID: 8 CORE ID: 13 THREADS: 4 CPUs: 116 117 118 119
CHIP ID: 8 CORE ID: 14 THREADS: 4 CPUs: 120 121 122 123
CHIP ID: 8 CORE ID: 15 THREADS: 4 CPUs: 124 125 126 127
CHIP ID: 8 CORE ID: 16 THREADS: 4 CPUs: 128 129 130 131
CHIP ID: 8 CORE ID: 17 THREADS: 4 CPUs: 132 133 134 135
CHIP ID: 8 CORE ID: 18 THREADS: 4 CPUs: 136 137 138 139
CHIP ID: 8 CORE ID: 19 THREADS: 4 CPUs: 140 141 142 143
-------
p[0]
eq[0,1,2,3,4,5]
ex[0,
c[0,
p[8]
eq[0,1,2,3,4]
ex[0,
c[0,
-------
----------Processor Layout-
p[0]
|EX-0 C0 | |EX-4 C8 | |EX-8 C16|
+ - - - - - + + - - - - - + + - - - - - +
|EX-0 C1 | |EX-4 C9 | |EX-8 C17|
+ - - - - - + + - - - - - + + - - - - - +
|EX-1 C2 | |EX-5 C10| |EX-9 C18|
+ - - - - - + + - - - - - + + - - - - - +
|EX-1 C3 | |EX-5 C11| |EX-9 C19|
|EX-2 C4 | | | | |
+ - - - - - + + - - - - - + + - - - - - +
|EX-2 C5 | | | | |
+ - - - - - + + - - - - - + + - - - - - +
| | |EX-7 C14| |EX-11 C22|
+ - - - - - + + - - - - - + + - - - - - +
| | |EX-7 C15| |EX-11 C23|
p[8]
|EX-0 C0 | | | |EX-8 C16|
+ - - - - - + + - - - - - + + - - - - - +
|EX-0 C1 | | | |EX-8 C17|
+ - - - - - + + - - - - - + + - - - - - +
|EX-1 C2 | |EX-5 C10| |EX-9 C18|
+ - - - - - + + - - - - - + + - - - - - +
|EX-1 C3 | |EX-5 C11| |EX-9 C19|
|EX-2 C4 | |EX-6 C12| | |
+ - - - - - + + - - - - - + + - - - - - +
|EX-2 C5 | |EX-6 C13| | |
+ - - - - - + + - - - - - + + - - - - - +
|EX-3 C6 | |EX-7 C14| | |
+ - - - - - + + - - - - - + + - - - - - +
|EX-3 C7 | |EX-7 C15| | |
root@ltc-wspoon8:~# ./statedisable.sh
./statedisable.sh: line 10: /sys/devices/
./statedisable.sh: line 11: /sys/devices/
root@ltc-wspoon8:~# ./run_workload.sh
root@ltc-wspoon8:~# ./scom_addr_p9.sh 0x1001080c 7
EQ[ 1]: 0x1101080c
EX[ 3]: 0x11010c0c
C[ 7]: 0x3701080c
root@ltc-wspoon8:~# ./skiboot/
0000000000000000
root@ltc-wspoon8:~# ./skiboot/
0c00000000000000
We see the following call traces in the kernel and there is no MCE recovered messages which was the expected output.
Ubuntu 18.04 LTS ltc-wspoon8 hvc0
ltc-wspoon8 login: [ 191.741142] Severe Machine check interrupt [Not recovered]
[ 191.741160] NIP [c000000000181b08]: osq_lock+0xb8/0x210
[ 191.741161] Initiator: CPU
[ 191.741163] Error type: UE [Load/Store]
[ 191.741166] opal: Hardware platform error: Unrecoverable Machine Check exception
[ 191.741172] CPU: 123 PID: 11888 Comm: find Tainted: G M 4.15.0-20-generic #21-Ubuntu
[ 191.741174] NIP: c000000000181b08 LR: c000000000cfa740 CTR: c000000000497f90
[ 191.741177] REGS: c000000007963d80 TRAP: 0200 Tainted: G M (4.15.0-20-generic)
[ 191.741178] MSR: 9000000000209033 <SF,HV,
[ 191.741188] CFAR: c000000000181b54 DAR: 00002018faf69194 DSISR: 00008000 SOFTE: 1
[ 191.741188] GPR00: c000000000cfa740 c000201857d47a30 c0000000016eae00 c0000000015c6b2c
[ 191.741188] GPR04: 0000000000000000 0000000000000000 c0000000017807c0 c000000007a20000
[ 191.741188] GPR08: c0002018faf69180 c0002018fb5e9180 c0002018faf69180 0000000000000000
[ 191.741188] GPR12: 0000000084002888 c000000007a74900 00000d02693c2b80 0000000000000000
[ 191.741188] GPR16: 0000000000000000 ffffffffffffff9c 00007fffc6e73f68 00000d02693e9510
[ 191.741188] GPR20: 0000000000000001 0000000000000000 fffffffffffffff6 0000000000000000
[ 191.741188] GPR24: c000201857d47c90 c0002018cdad201c fffffffffffff000 0000000000000004
[ 191.741188] GPR28: 0000000000000002 c0000000015c6b2c 0000000000000001 c0000000015c6b20
[ 191.741219] NIP [c000000000181b08] osq_lock+0xb8/0x210
[ 191.741224] LR [c000000000cfa740] __mutex_
[ 191.741225] Call Trace:
[ 191.741229] [c000201857d47a30] [c000000000cfa338] __mutex_
[ 191.741234] [c000201857d47ac0] [c000000000497fe0] kernfs_
[ 191.741238] [c000201857d47b00] [c0000000003e43f4] __inode_
[ 191.741241] [c000201857d47b50] [c0000000003e8bcc] link_path_
[ 191.741243] [c000201857d47bf0] [c0000000003eacbc] path_openat+
[ 191.741247] [c000201857d47c70] [c0000000003ec570] do_filp_
[ 191.741253] [c000201857d47da0] [c0000000003cfae8] do_sys_
[ 191.741257] [c000201857d47e30] [c00000000000b184] system_
[ 191.741259] Instruction dump:
[ 191.741261] 81490010 2faa0000 409e0160 782a0464 e94a0080 714a0004 40820068 3cc20009
[ 191.741267] 38c659c0 60420000 e9490008 e8e60000 <814a0014> 394affff 7d4a07b4 1d4a0b00
[ 191.743669] Severe Machine check interrupt [Recovered]
[ 191.743706] NIP [c000000000181b3c]: osq_lock+0xec/0x210
[ 191.743740] Initiator: CPU
[ 191.743766] Error type: UE [Load/Store]
[ 191.743811] WARNING: CPU: 97 PID: 11965 at /build/
[ 191.743885] Modules linked in: binfmt_misc ofpart cmdlinepart idt_89hpesx at24 opal_prd powernv_flash ipmi_powernv ipmi_devintf mtd vmx_crypto uio_pdrv_genirq ipmi_msghandler uio ibmpowernv sch_fq_codel ib_iser rdma_cm iw_cm ib_cm iscsi_tcp libiscsi_tcp libiscsi scsi_transport_
[ 191.744292] CPU: 97 PID: 11965 Comm: find Tainted: G M 4.15.0-20-generic #21-Ubuntu
[ 191.744350] NIP: c00000000014d6e0 LR: c00000000014e30c CTR: c00000000015a240
[ 191.744401] REGS: c00020185d9eb1e0 TRAP: 0700 Tainted: G M (4.15.0-20-generic)
[ 191.744458] MSR: 9000000000029033 <SF,HV,
[ 191.744516] CFAR: c00000000014d54c SOFTE: 0
[ 191.744516] GPR00: c00000000014e30c c00020185d9eb460 c0000000016eae00 c000001f14647300
[ 191.744516] GPR04: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 191.744516] GPR08: c000000001721ee0 0000000000000000 0000000000000000 9000000000001003
[ 191.744516] GPR12: 0000000028008224 c000000007a62b00 000003b558292b80 0000000000000000
[ 191.744516] GPR16: 0000000000000000 ffffffffffffff9c 00007fffdde92a38 000003b5582bbef0
[ 191.744516] GPR20: 0000000000000001 0000000000000000 fffffffffffffff6 c00020185d9eb5e0
[ 191.744516] GPR24: c000001f14647728 c00000000171dd78 c0000000011d8580 0000000000000000
[ 191.744516] GPR28: 0000000000000004 0000000000000000 0000000000000000 c000001f14647300
[ 191.749630] NIP [c00000000014d6e0] set_task_
[ 191.749709] LR [c00000000014e30c] try_to_
[ 191.749804] Call Trace:
[ 191.749846] [c00020185d9eb460] [c0000000011d8580] runqueues+0x0/0xc00 (unreliable)
[ 191.749943] [c00020185d9eb4a0] [c00000000014e30c] try_to_
[ 191.750072] [c00020185d9eb520] [c0000000001725d8] autoremove_
[ 191.750199] [c00020185d9eb550] [c000000000171b60] __wake_
[ 191.750316] [c00020185d9eb5c0] [c000000000171d4c] __wake_
[ 191.750444] [c00020185d9eb650] [c00000000018ea40] wake_up_
[ 191.750573] [c00020185d9eb680] [c000000000295d10] irq_work_
[ 191.750713] [c00020185d9eb6c0] [c000000000024ab4] __timer_
[ 191.750841] [c00020185d9eb710] [c000000000024d08] timer_interrupt
[ 191.750949] [c00020185d9eb740] [c000000000009014] decrementer_
[ 191.751079] --- interrupt: 901 at osq_lock+0xec/0x210
[ 191.751079] LR = __mutex_
[ 191.751255] [c00020185d9eba30] [c000000000cfa338] __mutex_
[ 191.751402] [c00020185d9ebac0] [c000000000497fe0] kernfs_
[ 191.751530] [c00020185d9ebb00] [c0000000003e43f4] __inode_
[ 191.751658] [c00020185d9ebb50] [c0000000003e8bcc] link_path_
[ 191.751785] [c00020185d9ebbf0] [c0000000003eacbc] path_openat+
[ 191.751894] [c00020185d9ebc70] [c0000000003ec570] do_filp_
[ 191.752003] [c00020185d9ebda0] [c0000000003cfae8] do_sys_
[ 191.752112] [c00020185d9ebe30] [c00000000000b184] system_
[ 191.752229] Instruction dump:
[ 191.752299] 7faa3670 7d4a0194 57a706be 7d4a07b4 794a1f24 7d28502a 7d293c36 71290001
[ 191.752441] 4082fe80 60000000 60000000 60420000 <0fe00000> 4bfffe6c 60000000 60420000
[ 191.752584] ---[ end trace 032f502244013ba3 ]---
[ 309.237017153,0] OPAL: Reboot requested due to Platform error.
[ 309.237089038,3] OPAL: Reboot requested due to Platform error.[ 309.237145569,5] Software initiated checkstop disabled.
[ 309.237200666,5] OPAL: Reboot request...
[ 309.247531874,5] Unable to log error
Stack trace output:
[ 191.749804] Call Trace:
[ 191.749846] [c00020185d9eb460] [c0000000011d8580] runqueues+0x0/0xc00 (unreliable)
[ 191.749943] [c00020185d9eb4a0] [c00000000014e30c] try_to_
[ 191.750072] [c00020185d9eb520] [c0000000001725d8] autoremove_
[ 191.750199] [c00020185d9eb550] [c000000000171b60] __wake_
[ 191.750316] [c00020185d9eb5c0] [c000000000171d4c] __wake_
[ 191.750444] [c00020185d9eb650] [c00000000018ea40] wake_up_
[ 191.750573] [c00020185d9eb680] [c000000000295d10] irq_work_
[ 191.750713] [c00020185d9eb6c0] [c000000000024ab4] __timer_
[ 191.750841] [c00020185d9eb710] [c000000000024d08] timer_interrupt
[ 191.750949] [c00020185d9eb740] [c000000000009014] decrementer_
[ 191.751079] --- interrupt: 901 at osq_lock+0xec/0x210
[ 191.751079] LR = __mutex_
[ 191.751255] [c00020185d9eba30] [c000000000cfa338] __mutex_
[ 191.751402] [c00020185d9ebac0] [c000000000497fe0] kernfs_
[ 191.751530] [c00020185d9ebb00] [c0000000003e43f4] __inode_
[ 191.751658] [c00020185d9ebb50] [c0000000003e8bcc] link_path_
[ 191.751785] [c00020185d9ebbf0] [c0000000003eacbc] path_openat+
[ 191.751894] [c00020185d9ebc70] [c0000000003ec570] do_filp_
[ 191.752003] [c00020185d9ebda0] [c0000000003cfae8] do_sys_
[ 191.752112] [c00020185d9ebe30] [c00000000000b184] system_
== Comment: #1 - PAVAMAN SUBRAMANIYAM <> - 2018-04-25 02:03:31 ==
I had a discussion with Mahesh about this bug and he has suggested to try out with the Patch which has been posted upstream in the below link:
http://
== Comment: #8 - PAVAMAN SUBRAMANIYAM <> - 2018-06-01 03:09:16 ==
Can we have the patch http://
tags: | added: architecture-ppc64le bugnameltc-167176 severity-high targetmilestone-inin1804 |
Changed in ubuntu: | |
assignee: | nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) |
affects: | ubuntu → linux (Ubuntu) |
tags: | added: p9 triage-g |
Changed in ubuntu-power-systems: | |
importance: | Undecided → High |
assignee: | nobody → Canonical Kernel Team (canonical-kernel-team) |
Changed in linux (Ubuntu): | |
status: | New → In Progress |
importance: | Undecided → High |
assignee: | Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Joseph Salisbury (jsalisbury) |
Changed in linux (Ubuntu Bionic): | |
status: | New → In Progress |
importance: | Undecided → High |
assignee: | nobody → Joseph Salisbury (jsalisbury) |
Changed in ubuntu-power-systems: | |
status: | New → In Progress |
no longer affects: | linux (Ubuntu Cosmic) |
Changed in linux (Ubuntu Bionic): | |
status: | In Progress → Fix Committed |
Changed in ubuntu-power-systems: | |
status: | In Progress → Fix Committed |
Changed in linux (Ubuntu): | |
status: | In Progress → Fix Committed |
tags: | added: cscc |
I built a test kernel with commit 75ecfb49516c53. The test kernel can be downloaded from: kernel. ubuntu. com/~jsalisbury /lp1774964
http://
Can you test this kernel and see if it resolves this bug?
Note about installing test kernels: unsigned .deb packages.
• If the test kernel is prior to 4.15(Bionic) you need to install the linux-image and linux-image-extra .deb packages.
• If the test kernel is 4.15(Bionic) or newer, you need to install the linux-modules, linux-modules-extra and linux-image-
Thanks in advance!