[LTCTest][Opal][OP820] Machine crashed with Oops: Kernel access of bad area, sig: 11 [#1] while executing Froze PE Error injection
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
High
|
Canonical Kernel Team | ||
Xenial |
Fix Released
|
Undecided
|
Tim Gardner |
Bug Description
== Comment: #0 - PAVAMAN SUBRAMANIYAM <email address hidden> - 2016-07-13 01:28:56 ==
---Problem Description---
Machine crashed with Oops: Kernel access of bad area, sig: 11 [#1]
---uname output---
Linux ltc-garri2 4.4.0-30-generic #49-Ubuntu SMP Fri Jul 1 10:00:36 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux
---Additional Hardware Info---
root@ltc-garri2:~# lspci
0000:00:00.0 PCI bridge: IBM Device 03dc
0000:01:00.0 Infiniband controller: Mellanox Technologies MT27600 [Connect-IB]
0001:00:00.0 PCI bridge: IBM Device 03dc
0002:00:00.0 PCI bridge: IBM Device 03dc
0002:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
0003:00:00.0 PCI bridge: IBM Device 03dc
0004:00:00.0 PCI bridge: IBM Device 03dc
0005:00:00.0 PCI bridge: IBM Device 03dc
0005:01:00.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ab)
0005:02:01.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ab)
0005:02:02.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ab)
0005:02:03.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ab)
0005:02:04.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ab)
0005:03:00.0 USB controller: Texas Instruments TUSB73x0 SuperSpeed USB 3.0 xHCI Host Controller (rev 02)
0005:04:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9235 PCIe 2.0 x2 4-port SATA 6 Gb/s Controller (rev 11)
0005:05:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 03)
0005:06:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 30)
0005:07:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5718 Gigabit Ethernet PCIe (rev 10)
0005:07:00.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5718 Gigabit Ethernet PCIe (rev 10)
0006:00:00.0 PCI bridge: IBM Device 03dc
0006:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
0007:00:00.0 PCI bridge: IBM Device 03dc
0008:00:00.0 Bridge: IBM Device 04ea
0008:00:00.1 Bridge: IBM Device 04ea
0008:00:01.0 Bridge: IBM Device 04ea
0008:00:01.1 Bridge: IBM Device 04ea
0009:00:00.0 Bridge: IBM Device 04ea
0009:00:00.1 Bridge: IBM Device 04ea
0009:00:01.0 Bridge: IBM Device 04ea
0009:00:01.1 Bridge: IBM Device 04ea
Machine Type = P8
---Debugger---
A debugger is not configured
---Steps to Reproduce---
Install a P8 Open Power 8335-GTB Hardware with Ubuntu 16.04.1.
Then execute the Frozen PE error injection tests as shown below:
root@ltc-garri2:~# lspci | grep -i 0004:00:00.0
0004:00:00.0 PCI bridge: IBM Device 03dc
root@ltc-garri2:~# cat /proc/powerpc/eeh | tail -n 1
eeh_slot_resets=0
root@ltc-garri2:~# lspci | grep -i 0004:00:00.0
0004:00:00.0 PCI bridge: IBM Device 03dc
root@ltc-garri2:~# cat /proc/powerpc/eeh | tail -n 1
eeh_slot_resets=0
root@ltc-garri2:~# echo 0:0:4:0:0 > /sys/kernel/
0004:00:00.0 0604: 1014:03dc
0
Immediately the kernel crashes with a Oops Message.
Contact Information = <email address hidden>
Stack trace output:
[ 289.297946] Call Trace:
[ 289.297969] [c000000feeb8b9e0] [c000000000083c78] pnv_eeh_
[ 289.298042] [c000000feeb8ba60] [c000000000038250] eeh_reset_
[ 289.298105] [c000000feeb8bb00] [c000000000af444c] eeh_reset_
[ 289.298165] [c000000feeb8bba0] [c00000000003c520] eeh_handle_
[ 289.298234] [c000000feeb8bc20] [c00000000003c9c4] eeh_handle_
[ 289.298304] [c000000feeb8bcd0] [c00000000003cd88] eeh_event_
[ 289.298374] [c000000feeb8bd80] [c0000000000e6420] kthread+0x110/0x130
[ 289.298434] [c000000feeb8be30] [c000000000009538] ret_from_
[ 289.298501] Instruction dump:
[ 289.298531] 60000000 813f0000 ebdf0010 792affe3 408200d4 e95e0250 812a000c 2f890002
[ 289.298630] 419e0054 7fe3fb78 4bfb70c5 60000000 <e9230010> 2fa90000 419e00dc e9290010
Oops output:
[ 289.294622] EEH: Frozen PE#0 on PHB#4 detected
[ 289.294785] EEH: PE location: N/A, PHB location: N/A
[ 289.295598] EEH: This PCI device has failed 1 times in the last hour
[ 289.295600] EEH: Notify device drivers to shutdown
[ 289.295605] EEH: Collect temporary log
[ 289.295632] EEH: of node=0004:00:00:0
[ 289.295635] EEH: PCI device/vendor: 03dc1014
[ 289.295638] EEH: PCI cmd/status register: 00100106
[ 289.295641] EEH: Bridge secondary status: 0000
[ 289.295644] EEH: Bridge control: 0002
[ 289.295645] EEH: PCI-E capabilities and status follow:
[ 289.295654] EEH: PCI-E 00: 00420010 00008002 00000040 00300103
[ 289.295661] EEH: PCI-E 10: 01010008 00000000 00000000 00010010
[ 289.295664] EEH: PCI-E 20: 00000000
[ 289.295665] EEH: PCI-E AER capability register set follows:
[ 289.295674] EEH: PCI-E AER 00: 14810001 00000000 0008d000 00000000
[ 289.295680] EEH: PCI-E AER 10: 00000000 00000000 000001e0 00000000
[ 289.295687] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[ 289.295690] EEH: PCI-E AER 30: 00000000 00000000
[ 289.295693] PHB3 PHB#4 Diag-data (Version: 1)
[ 289.295695] brdgCtl: 00000002
[ 289.295697] UtlSts: 00080000 00000000 00000000
[ 289.295699] RootSts: 00000040 00000000 01010008 00100102 00000000
[ 289.295701] PhbSts: 0000001c00000000 0000001c00000000
[ 289.295704] Lem: 0000000000100000 42498e367f502eae 0000000000000000
[ 289.295706] InAErr: 4000000000000000 4000000000000000 0202000000000000 0000000000000000
[ 289.295708] PE[ 0] A/B: 8440002b00000000 8000000000000000
[ 289.295711] EEH: Reset with hotplug activity
[ 289.295726] pci_bus 0004:01: busn_res: [bus 01] is released
[ 289.295868] Unable to handle kernel paging request for data at address 0x00000010
[ 289.295937] Faulting instruction address: 0xc000000000083c7c
[ 289.295997] Oops: Kernel access of bad area, sig: 11 [#1]
[ 289.296043] SMP NR_CPUS=2048 NUMA PowerNV
[ 289.296098] Modules linked in: ip6table_filter ip6_tables iptable_filter ip_tables x_tables ipmi_devintf input_leds joydev mac_hid hid_generic usbhid hid nvidia(POE) opal_prd ofpart cmdlinepart ibmpowernv at24 powernv_flash uio_pdrv_genirq ipmi_powernv mtd ipmi_msghandler powernv_rng uio autofs4 uas usb_storage ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ahci libahci mlx5_core
[ 289.296657] CPU: 1 PID: 651 Comm: eehd Tainted: P OE 4.4.0-30-generic #49-Ubuntu
[ 289.296726] task: c000000feeb02a20 ti: c000000feeb88000 task.ti: c000000feeb88000
[ 289.296787] NIP: c000000000083c7c LR: c000000000083c78 CTR: c000000000083c20
[ 289.296848] REGS: c000000feeb8b760 TRAP: 0300 Tainted: P OE (4.4.0-30-generic)
[ 289.296915] MSR: 9000000100009033 <SF,HV,
[ 289.297065] CFAR: c000000000008468 DAR: 0000000000000010 DSISR: 40000000 SOFTE: 1
[ 289.297867] NIP [c000000000083c7c] pnv_eeh_
[ 289.297907] LR [c000000000083c78] pnv_eeh_
[ 289.297946] Call Trace:
[ 289.297969] [c000000feeb8b9e0] [c000000000083c78] pnv_eeh_
[ 289.298042] [c000000feeb8ba60] [c000000000038250] eeh_reset_
[ 289.298105] [c000000feeb8bb00] [c000000000af444c] eeh_reset_
[ 289.298165] [c000000feeb8bba0] [c00000000003c520] eeh_handle_
[ 289.298234] [c000000feeb8bc20] [c00000000003c9c4] eeh_handle_
[ 289.298304] [c000000feeb8bcd0] [c00000000003cd88] eeh_event_
[ 289.298374] [c000000feeb8bd80] [c0000000000e6420] kthread+0x110/0x130
[ 289.298434] [c000000feeb8be30] [c000000000009538] ret_from_
[ 289.298501] Instruction dump:
[ 289.298531] 60000000 813f0000 ebdf0010 792affe3 408200d4 e95e0250 812a000c 2f890002
[ 289.298630] 419e0054 7fe3fb78 4bfb70c5 60000000 <e9230010> 2fa90000 419e00dc e9290010
[ 289.298731] ---[ end trace 393da961db41eff1 ]---
[ 289.452447]
System Dump Info:
The system is not configured to capture a system dump.
*Additional Instructions for <email address hidden>:
-Post a private note with access information to the machine that the bug is occuring on.
-Attach sysctl -a output output to the bug.
== Comment: #2 - Guo Wen Shan <email address hidden> - 2016-07-15 09:42:09 ==
Below two patches are needed:
https:/
("powerpc/eeh: Fix wrong argument passed to eeh_rmv_device()")
https:/
("powerpc/eeh: Fix invalid cached PE primary bus")
tags: | added: architecture-ppc64le bugnameltc-143706 severity-critical targetmilestone-inin16041 |
Changed in ubuntu: | |
assignee: | nobody → Taco Screen team (taco-screen-team) |
affects: | ubuntu → kernel-package (Ubuntu) |
affects: | kernel-package (Ubuntu) → linux (Ubuntu) |
Changed in linux (Ubuntu): | |
assignee: | Taco Screen team (taco-screen-team) → Canonical Kernel Team (canonical-kernel-team) |
importance: | Undecided → High |
status: | New → Triaged |
affects: | linux (Ubuntu) → kernel-package (Ubuntu) |
Changed in kernel-package (Ubuntu Xenial): | |
status: | In Progress → Fix Committed |
affects: | kernel-package (Ubuntu) → linux (Ubuntu) |
------- Comment From <email address hidden> 2016-07-18 09:33 EDT-------
Hello Canonical,
Could you also please target this bug to 16.10 in addition to 16.04.1?
Thanks, Gary