[LTCTest][Opal][OP820] Machine crashed with Oops: Kernel access of bad area, sig: 11 [#1] while executing Froze PE Error injection

Bug #1603449 reported by bugproxy
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Canonical Kernel Team
Xenial
Fix Released
Undecided
Tim Gardner

Bug Description

== Comment: #0 - PAVAMAN SUBRAMANIYAM <email address hidden> - 2016-07-13 01:28:56 ==
---Problem Description---
Machine crashed with Oops: Kernel access of bad area, sig: 11 [#1]

---uname output---
Linux ltc-garri2 4.4.0-30-generic #49-Ubuntu SMP Fri Jul 1 10:00:36 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux

---Additional Hardware Info---
root@ltc-garri2:~# lspci
0000:00:00.0 PCI bridge: IBM Device 03dc
0000:01:00.0 Infiniband controller: Mellanox Technologies MT27600 [Connect-IB]
0001:00:00.0 PCI bridge: IBM Device 03dc
0002:00:00.0 PCI bridge: IBM Device 03dc
0002:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
0003:00:00.0 PCI bridge: IBM Device 03dc
0004:00:00.0 PCI bridge: IBM Device 03dc
0005:00:00.0 PCI bridge: IBM Device 03dc
0005:01:00.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ab)
0005:02:01.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ab)
0005:02:02.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ab)
0005:02:03.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ab)
0005:02:04.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ab)
0005:03:00.0 USB controller: Texas Instruments TUSB73x0 SuperSpeed USB 3.0 xHCI Host Controller (rev 02)
0005:04:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9235 PCIe 2.0 x2 4-port SATA 6 Gb/s Controller (rev 11)
0005:05:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 03)
0005:06:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 30)
0005:07:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5718 Gigabit Ethernet PCIe (rev 10)
0005:07:00.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5718 Gigabit Ethernet PCIe (rev 10)
0006:00:00.0 PCI bridge: IBM Device 03dc
0006:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
0007:00:00.0 PCI bridge: IBM Device 03dc
0008:00:00.0 Bridge: IBM Device 04ea
0008:00:00.1 Bridge: IBM Device 04ea
0008:00:01.0 Bridge: IBM Device 04ea
0008:00:01.1 Bridge: IBM Device 04ea
0009:00:00.0 Bridge: IBM Device 04ea
0009:00:00.1 Bridge: IBM Device 04ea
0009:00:01.0 Bridge: IBM Device 04ea
0009:00:01.1 Bridge: IBM Device 04ea

Machine Type = P8

---Debugger---
A debugger is not configured

---Steps to Reproduce---
 Install a P8 Open Power 8335-GTB Hardware with Ubuntu 16.04.1.
Then execute the Frozen PE error injection tests as shown below:

root@ltc-garri2:~# lspci | grep -i 0004:00:00.0
0004:00:00.0 PCI bridge: IBM Device 03dc
root@ltc-garri2:~# cat /proc/powerpc/eeh | tail -n 1
eeh_slot_resets=0

root@ltc-garri2:~# lspci | grep -i 0004:00:00.0
0004:00:00.0 PCI bridge: IBM Device 03dc
root@ltc-garri2:~# cat /proc/powerpc/eeh | tail -n 1
eeh_slot_resets=0
root@ltc-garri2:~# echo 0:0:4:0:0 > /sys/kernel/debug/powerpc/PCI0004/err_injct && lspci -ns 0004:00:00.0; echo $?
0004:00:00.0 0604: 1014:03dc
0

Immediately the kernel crashes with a Oops Message.

Contact Information = <email address hidden>

Stack trace output:
 [ 289.297946] Call Trace:
[ 289.297969] [c000000feeb8b9e0] [c000000000083c78] pnv_eeh_reset+0x58/0x170 (unreliable)
[ 289.298042] [c000000feeb8ba60] [c000000000038250] eeh_reset_pe+0xb0/0x1c0
[ 289.298105] [c000000feeb8bb00] [c000000000af444c] eeh_reset_device+0xd8/0x228
[ 289.298165] [c000000feeb8bba0] [c00000000003c520] eeh_handle_normal_event+0x390/0x440
[ 289.298234] [c000000feeb8bc20] [c00000000003c9c4] eeh_handle_event+0x184/0x370
[ 289.298304] [c000000feeb8bcd0] [c00000000003cd88] eeh_event_handler+0x1d8/0x1e0
[ 289.298374] [c000000feeb8bd80] [c0000000000e6420] kthread+0x110/0x130
[ 289.298434] [c000000feeb8be30] [c000000000009538] ret_from_kernel_thread+0x5c/0xa4
[ 289.298501] Instruction dump:
[ 289.298531] 60000000 813f0000 ebdf0010 792affe3 408200d4 e95e0250 812a000c 2f890002
[ 289.298630] 419e0054 7fe3fb78 4bfb70c5 60000000 <e9230010> 2fa90000 419e00dc e9290010

Oops output:
 [ 289.294622] EEH: Frozen PE#0 on PHB#4 detected
[ 289.294785] EEH: PE location: N/A, PHB location: N/A
[ 289.295598] EEH: This PCI device has failed 1 times in the last hour
[ 289.295600] EEH: Notify device drivers to shutdown
[ 289.295605] EEH: Collect temporary log
[ 289.295632] EEH: of node=0004:00:00:0
[ 289.295635] EEH: PCI device/vendor: 03dc1014
[ 289.295638] EEH: PCI cmd/status register: 00100106
[ 289.295641] EEH: Bridge secondary status: 0000
[ 289.295644] EEH: Bridge control: 0002
[ 289.295645] EEH: PCI-E capabilities and status follow:
[ 289.295654] EEH: PCI-E 00: 00420010 00008002 00000040 00300103
[ 289.295661] EEH: PCI-E 10: 01010008 00000000 00000000 00010010
[ 289.295664] EEH: PCI-E 20: 00000000
[ 289.295665] EEH: PCI-E AER capability register set follows:
[ 289.295674] EEH: PCI-E AER 00: 14810001 00000000 0008d000 00000000
[ 289.295680] EEH: PCI-E AER 10: 00000000 00000000 000001e0 00000000
[ 289.295687] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[ 289.295690] EEH: PCI-E AER 30: 00000000 00000000
[ 289.295693] PHB3 PHB#4 Diag-data (Version: 1)
[ 289.295695] brdgCtl: 00000002
[ 289.295697] UtlSts: 00080000 00000000 00000000
[ 289.295699] RootSts: 00000040 00000000 01010008 00100102 00000000
[ 289.295701] PhbSts: 0000001c00000000 0000001c00000000
[ 289.295704] Lem: 0000000000100000 42498e367f502eae 0000000000000000
[ 289.295706] InAErr: 4000000000000000 4000000000000000 0202000000000000 0000000000000000
[ 289.295708] PE[ 0] A/B: 8440002b00000000 8000000000000000
[ 289.295711] EEH: Reset with hotplug activity
[ 289.295726] pci_bus 0004:01: busn_res: [bus 01] is released
[ 289.295868] Unable to handle kernel paging request for data at address 0x00000010
[ 289.295937] Faulting instruction address: 0xc000000000083c7c
[ 289.295997] Oops: Kernel access of bad area, sig: 11 [#1]
[ 289.296043] SMP NR_CPUS=2048 NUMA PowerNV
[ 289.296098] Modules linked in: ip6table_filter ip6_tables iptable_filter ip_tables x_tables ipmi_devintf input_leds joydev mac_hid hid_generic usbhid hid nvidia(POE) opal_prd ofpart cmdlinepart ibmpowernv at24 powernv_flash uio_pdrv_genirq ipmi_powernv mtd ipmi_msghandler powernv_rng uio autofs4 uas usb_storage ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ahci libahci mlx5_core
[ 289.296657] CPU: 1 PID: 651 Comm: eehd Tainted: P OE 4.4.0-30-generic #49-Ubuntu
[ 289.296726] task: c000000feeb02a20 ti: c000000feeb88000 task.ti: c000000feeb88000
[ 289.296787] NIP: c000000000083c7c LR: c000000000083c78 CTR: c000000000083c20
[ 289.296848] REGS: c000000feeb8b760 TRAP: 0300 Tainted: P OE (4.4.0-30-generic)
[ 289.296915] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 28008822 XER: 00000000
[ 289.297065] CFAR: c000000000008468 DAR: 0000000000000010 DSISR: 40000000 SOFTE: 1
               GPR00: c000000000083c78 c000000feeb8b9e0 c0000000015b5d00 0000000000000000
               GPR04: 0000000000000001 c000000feeb8bac0 c000001e4e693540 0000000000000ff7
               GPR08: 0000000000000000 0000000000000000 0000000000000000 000000000000001c
               GPR12: c000000000083c20 c000000007b20980 c0000000000e6318 c000001e4e7a0340
               GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
               GPR20: 0000000000000000 0000000000000000 0000000000000000 c000000000d42468
               GPR24: c000000000d42440 0000000000000100 c000000000036460 0000000000000000
               GPR28: c00000000161a3f0 0000000000000001 c000001fff764480 c000001e4e744000
[ 289.297867] NIP [c000000000083c7c] pnv_eeh_reset+0x5c/0x170
[ 289.297907] LR [c000000000083c78] pnv_eeh_reset+0x58/0x170
[ 289.297946] Call Trace:
[ 289.297969] [c000000feeb8b9e0] [c000000000083c78] pnv_eeh_reset+0x58/0x170 (unreliable)
[ 289.298042] [c000000feeb8ba60] [c000000000038250] eeh_reset_pe+0xb0/0x1c0
[ 289.298105] [c000000feeb8bb00] [c000000000af444c] eeh_reset_device+0xd8/0x228
[ 289.298165] [c000000feeb8bba0] [c00000000003c520] eeh_handle_normal_event+0x390/0x440
[ 289.298234] [c000000feeb8bc20] [c00000000003c9c4] eeh_handle_event+0x184/0x370
[ 289.298304] [c000000feeb8bcd0] [c00000000003cd88] eeh_event_handler+0x1d8/0x1e0
[ 289.298374] [c000000feeb8bd80] [c0000000000e6420] kthread+0x110/0x130
[ 289.298434] [c000000feeb8be30] [c000000000009538] ret_from_kernel_thread+0x5c/0xa4
[ 289.298501] Instruction dump:
[ 289.298531] 60000000 813f0000 ebdf0010 792affe3 408200d4 e95e0250 812a000c 2f890002
[ 289.298630] 419e0054 7fe3fb78 4bfb70c5 60000000 <e9230010> 2fa90000 419e00dc e9290010
[ 289.298731] ---[ end trace 393da961db41eff1 ]---
[ 289.452447]

System Dump Info:
  The system is not configured to capture a system dump.

*Additional Instructions for <email address hidden>:
-Post a private note with access information to the machine that the bug is occuring on.
-Attach sysctl -a output output to the bug.

== Comment: #2 - Guo Wen Shan <email address hidden> - 2016-07-15 09:42:09 ==
Below two patches are needed:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=cca0e542e02e48cce541a49c4046ec094ec27c1e
("powerpc/eeh: Fix wrong argument passed to eeh_rmv_device()")

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=a3aa256b7258b3d19f8b44557cc64525a993b941
("powerpc/eeh: Fix invalid cached PE primary bus")

bugproxy (bugproxy)
tags: added: architecture-ppc64le bugnameltc-143706 severity-critical targetmilestone-inin16041
Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → kernel-package (Ubuntu)
affects: kernel-package (Ubuntu) → linux (Ubuntu)
Changed in linux (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → Canonical Kernel Team (canonical-kernel-team)
importance: Undecided → High
status: New → Triaged
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-07-18 09:33 EDT-------
Hello Canonical,
Could you also please target this bug to 16.10 in addition to 16.04.1?
Thanks, Gary

Revision history for this message
Tim Gardner (timg-tpi) wrote :

Guo Wen Shan - how about if you take a stab at the backports for these 2 patches, 'cause I don't think they make sense for a 4.4 kernel.

Changed in linux (Ubuntu Xenial):
assignee: nobody → Tim Gardner (timg-tpi)
status: New → In Progress
Revision history for this message
bugproxy (bugproxy) wrote : Backported fix to Ubuntu-4.4.0-31.50

------- Comment (attachment only) From <email address hidden> 2016-07-18 21:15 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-07-18 21:21 EDT-------
Yeah, There is only one patch should be backported and it should fix the kernel crash. The patch is backported to Ubuntu-4.4.0-31.50 and attached. Note I checked out the base kernel code from below git repo:

git://kernel.ubuntu.com/ubuntu/ubuntu-xenial.git (branch: master)

Another patch (as below link shows) cann't be backported to ubuntu 4.4.0 yet as the fix depends on EEH support for SRIOV which isn't there. Lets backport it when needed.

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=cca0e542e02e48cce541a49c4046ec094ec27c1e ..... ("powerpc/eeh: Fix wrong argument passed to eeh_rmv_device()")

Revision history for this message
Tim Gardner (timg-tpi) wrote :
bugproxy (bugproxy)
affects: linux (Ubuntu) → kernel-package (Ubuntu)
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-07-21 09:59 EDT-------
*** Bug 143838 has been marked as a duplicate of this bug. ***

Revision history for this message
bugproxy (bugproxy) wrote : Backported fix to Ubuntu-4.4.0-31.50

------- Comment (attachment only) From <email address hidden> 2016-07-18 21:15 EDT-------

Changed in kernel-package (Ubuntu Xenial):
status: In Progress → Fix Committed
Stefan Bader (smb)
affects: kernel-package (Ubuntu) → linux (Ubuntu)
Revision history for this message
Stefan Bader (smb) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-08-17 02:35 EDT-------
I have installed the latest Ubuntu 16.04.1 kernel and executed the test again.

root@ltc-garri2:~# uname -a
Linux ltc-garri2 4.4.0-36-generic #55-Ubuntu SMP Thu Aug 11 18:00:57 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux
root@ltc-garri2:~# cat /etc/os-release
NAME="Ubuntu"
VERSION="16.04.1 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.1 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial

root@ltc-garri2:~# lspci | grep -i 0001:00:00.0
0001:00:00.0 PCI bridge: IBM Device 03dc
root@ltc-garri2:~# cat /proc/powerpc/eeh | tail -n 1
eeh_slot_resets=0

root@ltc-garri2:~# lspci | grep -i 0001:00:00.0
0001:00:00.0 PCI bridge: IBM Device 03dc
root@ltc-garri2:~# echo 0:0:1:0:0 > /sys/kernel/debug/powerpc/PCI0001/err_injct && lspci -ns 0001:00:00.0; echo $?
0001:00:00.0 0604: 1014:03dc
0

root@ltc-garri2:~# lspci | grep -i 0008:00:00.0
0008:00:00.0 PCI bridge: IBM Device 03dc
root@ltc-garri2:~# cat /proc/powerpc/eeh | tail -n 1
eeh_slot_resets=0
root@ltc-garri2:~# echo 0:0:1:0:0 > /sys/kernel/debug/powerpc/PCI0008/err_injct && lspci -ns 0008:00:00.0; echo $?
0008:00:00.0 0604: 1014:03dc
0

The machine did not crash with kernel panic and the issue is resolved with the latest kernel fixes.

tags: added: verification-done-xenial
removed: verification-needed-xenial
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (13.4 KiB)

This bug was fixed in the package linux - 4.4.0-36.55

---------------
linux (4.4.0-36.55) xenial; urgency=low

  [ Stefan Bader ]

  * Release Tracking Bug
    - LP: #1612305

  * I2C touchpad does not work on AMD platform (LP: #1612006)
    - SAUCE: pinctrl/amd: Remove the default de-bounce time

  * CVE-2016-5696
    - tcp: make challenge acks less predictable

linux (4.4.0-35.54) xenial; urgency=low

  [ Stefan Bader ]

  * Release Tracking Bug
    - LP: #1611215

  * [i915_bpo] Sync with v4.7 (LP: #1609742)
    - SAUCE: i915_bpo: Sync with v4.7

  * s390/cio: fix reset of channel measurement block (LP: #1609415)
    - s390/cio: allow to reset channel measurement block

  * in Ubuntu16.10: Hit on Call traces and system goes down when transactional
    memory tests are running in 32TB Brazos system (LP: #1606786)
    - powerpc/tm: Avoid SLB faults in treclaim/trecheckpoint when RI=0
    - powerpc/tm: Fix stack pointer corruption in __tm_recheckpoint()

  * Power Menu does not display after press the Power Button (LP: #1609204)
    - intel-vbtn: new driver for Intel Virtual Button
    - [config] enable CONFIG_INTEL_VBTN=m

  * OptiPlex 7450 AIO hangs when rebooting (LP: #1608762)
    - x86/reboot: Add Dell Optiplex 7450 AIO reboot quirk

  * virtualbox+usb 3.0 breaks boot, -28 kernel works (LP: #1604058)
    - SAUCE: xhci: Fix soft lockup in xhci_pci_probe path when XHCI_STATE_HALTED

  * linux-kernel: Freeing IRQ from IRQ context (LP: #1597908)
    - block: defer timeouts to a workqueue

  * Tunnel offload indications not stripped from encapsulated packets, causing
    performance overhead (LP: #1602755)
    - tunnels: Remove encapsulation offloads on decap.

  * lm-sensors is throwing "ERROR: Can't get value of subfeature temp1_input:
    I/O error" for be2net driver (LP: #1607387)
    - be2net: perform temperature query in adapter regardless of its interface
      state

  * Dell dock MAC Address pass through doesn't work in Ubuntu (LP: #1579984)
    - r8152: Add support for setting pass through MAC address on RTL8153-AD

  * vmxnet3 LRO IPv6 performance issues (stalling TCP) (LP: #1605494)
    - Driver: Vmxnet3: set CHECKSUM_UNNECESSARY for IPv6 packets

  * ISST-LTE:pVM:monklp5:Ubuntu16.04.1:system crashed at
    lpfc_sli4_scmd_to_wqidx_distr (LP: #1597974)
    - SAUCE: lpfc: fix oops in lpfc_sli4_scmd_to_wqidx_distr() from
      lpfc_send_taskmgmt()

  * Backport cxlflash shutdown patch to Xenial SRU (LP: #1605405)
    - SAUCE: cxlflash: Verify problem state area is mapped before notifying
      shutdown

  * Xenial update to v4.4.16 stable release (LP: #1607404)
    - mac80211: fix fast_tx header alignment
    - mac80211: mesh: flush mesh paths unconditionally
    - mac80211_hwsim: Add missing check for HWSIM_ATTR_SIGNAL
    - mac80211: Fix mesh estab_plinks counting in STA removal case
    - EDAC, sb_edac: Fix rank lookup on Broadwell
    - IB/cm: Fix a recently introduced locking bug
    - IB/mlx4: Properly initialize GRH TClass and FlowLabel in AHs
    - powerpc/pseries: Fix IBM_ARCH_VEC_NRCORES_OFFSET since POWER8NVL was added
    - powerpc/tm: Always reclaim in start_thread() for exec() class syscalls
    - usb: dwc2: fix reg...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
Seth Forshee (sforshee) wrote :

Setting development task to fix released since the fix has been upstream since 4.7-rc6.

Changed in linux (Ubuntu):
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.