[LTCTest][Opal][OP820] Machine crashed with Oops: Kernel access of bad area, sig: 11 [#1] while executing Froze PE Error injection

Bug #1604420 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
kernel-package (Ubuntu)
New
Undecided
Canonical Kernel Team

Bug Description

== Comment: #0 - PAVAMAN SUBRAMANIYAM <email address hidden> - 2016-07-18 04:18:33 ==
---Problem Description---
Machine crashed with Oops: Kernel access of bad area, sig: 11 [#1]

Contact Information = <email address hidden>

---uname output---
Linux ltc-garri2 4.4.0-31-generic #50-Ubuntu SMP Wed Jul 13 00:05:18 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux

---Additional Hardware Info---
root@ltc-garri2:~# lspci
0000:00:00.0 PCI bridge: IBM Device 03dc
0000:01:00.0 Infiniband controller: Mellanox Technologies MT27600 [Connect-IB]
0001:00:00.0 PCI bridge: IBM Device 03dc
0002:00:00.0 PCI bridge: IBM Device 03dc
0002:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
0003:00:00.0 PCI bridge: IBM Device 03dc
0004:00:00.0 PCI bridge: IBM Device 03dc
0005:00:00.0 PCI bridge: IBM Device 03dc
0005:01:00.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ab)
0005:02:01.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ab)
0005:02:02.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ab)
0005:02:03.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ab)
0005:02:04.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ab)
0005:03:00.0 USB controller: Texas Instruments TUSB73x0 SuperSpeed USB 3.0 xHCI Host Controller (rev 02)
0005:04:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9235 PCIe 2.0 x2 4-port SATA 6 Gb/s Controller (rev 11)
0005:05:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 03)
0005:06:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 30)
0005:07:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5718 Gigabit Ethernet PCIe (rev 10)
0005:07:00.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5718 Gigabit Ethernet PCIe (rev 10)
0006:00:00.0 PCI bridge: IBM Device 03dc
0006:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
0007:00:00.0 PCI bridge: IBM Device 03dc
0008:00:00.0 Bridge: IBM Device 04ea
0008:00:00.1 Bridge: IBM Device 04ea
0008:00:01.0 Bridge: IBM Device 04ea
0008:00:01.1 Bridge: IBM Device 04ea
0009:00:00.0 Bridge: IBM Device 04ea
0009:00:00.1 Bridge: IBM Device 04ea
0009:00:01.0 Bridge: IBM Device 04ea
0009:00:01.1 Bridge: IBM Device 04ea

Machine Type = P8

---Debugger---
A debugger is not configured

---Steps to Reproduce---
Install a P8 Open Power 8335-GTB Hardware with Ubuntu 16.10.
Then execute the Frozen PE error injection tests as shown below:

root@ltc-garri2:~# lspci | grep -i 0004:00:00.0
0004:00:00.0 PCI bridge: IBM Device 03dc

root@ltc-garri2:~# cat /proc/powerpc/eeh | tail -n 1
eeh_slot_resets=0

root@ltc-garri2:~# echo 0:0:4:0:0 > /sys/kernel/debug/powerpc/PCI0004/err_injct && lspci -ns 0004:00:00.0; echo $?
0004:00:00.0 0604: 1014:03dc
0

Immediately the kernel crashes with a Oops Message.

Ubuntu Yakkety Yak (development branch) ltc-garri2 hvc0

ltc-garri2 login: [ 3231.801472] EEH: Frozen PE#0 on PHB#4 detected
[ 3231.801639] EEH: PE location: N/A, PHB location: N/A
[ 3231.802772] Unable to handle kernel paging request for data at address 0x00000010
[ 3231.802841] Faulting instruction address: [1718257727455,3] OPAL: Trying a CPU re-init with flags: 0x1
[1719041828975,3] OPAL: Trying a CPU re-init with flags: 0x2
0xc000000000083c7c
[ 3231.802898] Oops: Kernel access of bad area, sig: 11 [#1]
[ 3231.802944] SMP NR_CPUS=2048 NUMA PowerNV
[ 3231.802994] Modules linked in: ipmi_devintf ip6table_filter ip6_tables iptable_fi[1721485466204,3] PCI-SLOT-0000000000000000 Invalid state 00000000
[1722574789913,3] PCI-SLOT-0000000000000001 Invalid state 00000000
[1725814514431,3] PCI-SLOT-0000000000000002 Invalid state 00000000
[1726903821324,3] PCI-SLOT-0000000000000003 Invalid state 00000000
[1727993145172,3] PCI-SLOT-0000000000000008 Invalid state 00000000

Stack trace output:
 [ 3231.804789] Call Trace:
[ 3231.804812] [c000000feee8b9e0] [c000000000083c78] pnv_eeh_reset+0x58/0x170 (unreliable)
[ 3231.804882] [c000000feee8ba60] [c000000000038250] eeh_reset_pe+0xb0/0x1c0
[ 3231.804944] [c000000feee8bb00] [c000000000af444c] eeh_reset_device+0xd8/0x228
[ 3231.805005] [c000000feee8bba0] [c00000000003c520] eeh_handle_normal_event+0x390/0x440
[ 3231.805073] [c000000feee8bc20] [c00000000003c9c4] eeh_handle_event+0x184/0x370
[ 3231.805311] [c000000feee8bcd0] [c00000000003cd88] eeh_event_handler+0x1d8/0x1e0
[ 3231.805391] [c000000feee8bd80] [c0000000000e6420] kthread+0x110/0x130
[ 3231.805452] [c000000feee8be30] [c000000000009538] ret_from_kernel_thread+0x5c/0xa4

Oops output:
 [ 3231.801472] EEH: Frozen PE#0 on PHB#4 detected
[ 3231.801639] EEH: PE location: N/A, PHB location: N/A
[ 3231.802463] EEH: This PCI device has failed 1 times in the last hour
[ 3231.802465] EEH: Notify device drivers to shutdown
[ 3231.802469] EEH: Collect temporary log
[ 3231.802496] EEH: of node=0004:00:00:0
[ 3231.802499] EEH: PCI device/vendor: 03dc1014
[ 3231.802502] EEH: PCI cmd/status register: 00100106
[ 3231.802505] EEH: Bridge secondary status: 0000
[ 3231.802508] EEH: Bridge control: 0002
[ 3231.802509] EEH: PCI-E capabilities and status follow:
[ 3231.802518] EEH: PCI-E 00: 00420010 00008002 00000040 00300103
[ 3231.802525] EEH: PCI-E 10: 01010008 00000000 00000000 00010010
[ 3231.802527] EEH: PCI-E 20: 00000000
[ 3231.802529] EEH: PCI-E AER capability register set follows:
[ 3231.802537] EEH: PCI-E AER 00: 14810001 00000000 0008d000 00000000
[ 3231.802544] EEH: PCI-E AER 10: 00000000 00000000 000001e0 00000000
[ 3231.802551] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[ 3231.802554] EEH: PCI-E AER 30: 00000000 00000000
[ 3231.802556] PHB3 PHB#4 Diag-data (Version: 1)
[ 3231.802558] brdgCtl: 00000002
[ 3231.802560] UtlSts: 00080000 00000000 00000000
[ 3231.802562] RootSts: 00000040 00000000 01010008 00100102 00000000
[ 3231.802564] PhbSts: 0000001c00000000 0000001c00000000
[ 3231.802567] Lem: 0000000000100000 42498e367f502eae 0000000000000000
[ 3231.802569] InAErr: 4000000000000000 4000000000000000 0202000000000000 0000000000000000
[ 3231.802571] PE[ 0] A/B: 8440002b00000000 8000000000000000
[ 3231.802574] EEH: Reset with hotplug activity
[ 3231.802590] pci_bus 0004:01: busn_res: [bus 01] is released
[ 3231.802772] Unable to handle kernel paging request for data at address 0x00000010
[ 3231.802841] Faulting instruction address: 0xc000000000083c7c
[ 3231.802898] Oops: Kernel access of bad area, sig: 11 [#1]
[ 3231.802944] SMP NR_CPUS=2048 NUMA PowerNV
[ 3231.802994] Modules linked in: ipmi_devintf ip6table_filter ip6_tables iptable_filter ip_tables x_tables joydev input_leds mac_hid hid_generic usbhid hid at24 ofpart cmdlinepart powernv_flash ipmi_powernv uio_pdrv_genirq ipmi_msghandler mtd uio opal_prd powernv_rng ibmpowernv autofs4 uas usb_storage nouveau ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ahci libahci mlx5_core
[ 3231.803524] CPU: 10 PID: 611 Comm: eehd Not tainted 4.4.0-31-generic #50-Ubuntu
[ 3231.803583] task: c000000feee02a20 ti: c000000feee88000 task.ti: c000000feee88000
[ 3231.803644] NIP: c000000000083c7c LR: c000000000083c78 CTR: c000000000083c20
[ 3231.803707] REGS: c000000feee8b760 TRAP: 0300 Not tainted (4.4.0-31-generic)
[ 3231.803765] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 28008822 XER: 00000000
[ 3231.803915] CFAR: c000000000008468 DAR: 0000000000000010 DSISR: 40000000 SOFTE: 1
               GPR00: c000000000083c78 c000000feee8b9e0 c0000000015b5d00 0000000000000000
               GPR04: 0000000000000001 c000000feee8bac0 c000001e4ec732b0 0000000000000ff0
               GPR08: 0000000000000000 0000000000000000 0000000000000000 000000000000001b
               GPR12: c000000000083c20 c000000007b25f00 c0000000000e6318 c000001e4ecf0340
               GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
               GPR20: 0000000000000000 0000000000000000 0000000000000000 c000000000d42468
               GPR24: c000000000d42440 0000000000000100 c000000000036460 0000000000000000
               GPR28: c00000000161a3f0 0000000000000001 c000001fff766780 c000001e4ed44000
[ 3231.804708] NIP [c000000000083c7c] pnv_eeh_reset+0x5c/0x170
[ 3231.804749] LR [c000000000083c78] pnv_eeh_reset+0x58/0x170
[ 3231.804789] Call Trace:
[ 3231.804812] [c000000feee8b9e0] [c000000000083c78] pnv_eeh_reset+0x58/0x170 (unreliable)
[ 3231.804882] [c000000feee8ba60] [c000000000038250] eeh_reset_pe+0xb0/0x1c0
[ 3231.804944] [c000000feee8bb00] [c000000000af444c] eeh_reset_device+0xd8/0x228
[ 3231.805005] [c000000feee8bba0] [c00000000003c520] eeh_handle_normal_event+0x390/0x440
[ 3231.805073] [c000000feee8bc20] [c00000000003c9c4] eeh_handle_event+0x184/0x370
[ 3231.805311] [c000000feee8bcd0] [c00000000003cd88] eeh_event_handler+0x1d8/0x1e0
[ 3231.805391] [c000000feee8bd80] [c0000000000e6420] kthread+0x110/0x130
[ 3231.805452] [c000000feee8be30] [c000000000009538] ret_from_kernel_thread+0x5c/0xa4
[ 3231.805519] Instruction dump:
[ 3231.805550] 60000000 813f0000 ebdf0010 792affe3 408200d4 e95e0250 812a000c 2f890002
[ 3231.805653] 419e0054 7fe3fb78 4bfb70c5 60000000 <e9230010> 2fa90000 419e00dc e9290010
[ 3231.811881] ---[ end trace e5268555486ccf38 ]---
[ 3231.890155]
[ 3231.890244] Sending IPI to other CPUs
[ 3231.891305] IPI complete
[ 3231.892479] kexec: waiting for cpu 1 (physical 17) to enter OPAL
[ 3231.894550] kexec: waiting for cpu 24 (physical 48) to enter OPAL

System Dump Info:
  The system is not configured to capture a system dump.

*Additional Instructions for <email address hidden>:
-Post a private note with access information to the machine that the bug is occuring on.
-Attach sysctl -a output output to the bug.

== Comment: #1 - PAVAMAN SUBRAMANIYAM <email address hidden> - 2016-07-18 04:21:50 ==
Below two patches are needed to be incorporated for fixing this issue.

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=cca0e542e02e48cce541a49c4046ec094ec27c1e
("powerpc/eeh: Fix wrong argument passed to eeh_rmv_device()")

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=a3aa256b7258b3d19f8b44557cc64525a993b941
("powerpc/eeh: Fix invalid cached PE primary bus")

== Comment: #6 - Guo Wen Shan <email address hidden> - 2016-07-18 22:38:44 ==
There is only one fix (as below) can be applied to ubuntu xenial. The other one can't be applied as it depends on the EEH support for SRIOV which isn't in ubuntu xenial yet.

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=a3aa256b7258b3d19f8b44557cc64525a993b941
("powerpc/eeh: Fix invalid cached PE primary bus")

Also, the above fix should be backported to ubuntu xenial. As the EEH code between ubuntu 16.04.1 and 16.10 should be almost same. The backported fix can be got from bug 143706. I also can attach it to this bugzilla on request.

bugproxy (bugproxy)
tags: added: architecture-ppc64le bugnameltc-143838 severity-critical targetmilestone-inin1610
Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → kernel-package (Ubuntu)
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-07-19 09:40 EDT-------
(In reply to comment #6)
> There is only one fix (as below) can be applied to ubuntu xenial. The other
> one can't be applied as it depends on the EEH support for SRIOV which isn't
> in ubuntu xenial yet.
>
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/
> ?id=a3aa256b7258b3d19f8b44557cc64525a993b941
> ("powerpc/eeh: Fix invalid cached PE primary bus")
>
> Also, the above fix should be backported to ubuntu xenial. As the EEH code
> between ubuntu 16.04.1 and 16.10 should be almost same. The backported fix
> can be got from bug 143706. I also can attach it to this bugzilla on request.

Canonical:
LTC bug 143706 referenced above is LP bug 1603449

Steve Langasek (vorlon)
affects: kernel-package (Ubuntu) → linux (Ubuntu)
Changed in linux (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → Canonical Kernel Team (canonical-kernel-team)
Revision history for this message
Tim Gardner (timg-tpi) wrote :

Setting this to be a duplicate of 1603449

bugproxy (bugproxy)
affects: linux (Ubuntu) → kernel-package (Ubuntu)
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-07-21 09:59 EDT-------
*** This bug has been marked as a duplicate of bug 143706 ***

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.